On the Convergence of Modified Policy Iteration in Risk-Sensitive Exponential Cost Markov Decision Processes

Yashaswini Murthy
Corresponding Author
Yashaswini Murthy
[email protected]
https://orcid.org/0000-0002-8788-6873
Computing and Mathematical Sciences, California Institute of Technology, Pasadena, California 91125
Search for more papers by this author
,
Mehrdad Moharrami
Mehrdad Moharrami
[email protected]
https://orcid.org/0000-0003-3907-8406
Computer Science, University of Iowa, Iowa City, Iowa 52242
Search for more papers by this author
,
Rayadurgam Srikant
Rayadurgam Srikant
[email protected]
https://orcid.org/0000-0003-1483-5204
Electrical and Computer Engineering and Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Champaign, Illinois 61820
Search for more papers by this author

Yashaswini Murthy

Corresponding Author

Yashaswini Murthy

[email protected]

https://orcid.org/0000-0002-8788-6873

Computing and Mathematical Sciences, California Institute of Technology, Pasadena, California 91125

Search for more papers by this author

Mehrdad Moharrami

[email protected]

https://orcid.org/0000-0003-3907-8406

Computer Science, University of Iowa, Iowa City, Iowa 52242

Search for more papers by this author

Rayadurgam Srikant

[email protected]

https://orcid.org/0000-0003-1483-5204

Electrical and Computer Engineering and Coordinated Science Laboratory, University of Illinois Urbana-Champaign, Champaign, Illinois 61820

Search for more papers by this author

Published Online:27 Nov 2025https://doi.org/10.1287/opre.2024.0818

References

Başar T, Bernhard P (2008) H-Infinity Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach (Springer Science & Business Media, New York).Crossref, Google Scholar
Basu A, Bhattacharyya T, Borkar VS (2008) A learning algorithm for risk-sensitive cost. Math. Oper. Res. 33(4):880–898.Link, Google Scholar
Bertsekas D (2012a) Dynamic Programming and Optimal Control: Volume I (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas D (2012b) Dynamic Programming and Optimal Control: Volume II; Approximate Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bielecki T, Hernández-Hernández D, Pliska SR (1999) Risk sensitive control of finite state Markov chains in discrete time, with applications to portfolio management. Math. Methods Oper. Res. 50(2):167–188.Crossref, Google Scholar
Borkar VS (2001) A sensitivity formula for risk-sensitive cost and the actor–critic algorithm. Systems Control Lett. 44(5):339–346.Crossref, Google Scholar
Borkar VS (2002) Q-learning for risk-sensitive control. Math. Oper. Res. 27(2):294–311.Link, Google Scholar
Borkar VS (2010) Learning algorithms for risk-sensitive control. Proc. 19th Internat. Sympos. Math. Theory Networks Systems (Budapest), vol. 5.Google Scholar
Borkar VS, Meyn SP (2002) Risk-sensitive optimal control for Markov decision processes with monotone cost. Math. Oper. Res. 27(1):192–209.Link, Google Scholar
Cavazos-Cadena R, Montes-de Oca R (2003) The value iteration algorithm in risk-sensitive average Markov decision chains with finite state space. Math. Oper. Res. 28(4):752–776.Link, Google Scholar
Chen Z, Yu P, Haskell WB (2019) Distributionally robust optimization for sequential decision-making. Optimization 68(12):2397–2426.Crossref, Google Scholar
Clement JG, Kroer C (2021) First-order methods for Wasserstein distributionally robust MDP. Internat. Conf. Machine Learn., 2010–2019 (PMLR, New York).Google Scholar
Donsker MD, Varadhan SS (1975) On a variational formula for the principal eigenvalue for operators with maximum principle. Proc. Natl. Acad. Sci. USA 72(3):780–783.Crossref, Google Scholar
Dullerud GE, Paganini F (2013) A Course in Robust Control Theory: A Convex Approach, vol. 36 (Springer Science & Business Media, New York).Google Scholar
Efroni Y, Dalal G, Scherrer B, Mannor S (2018) Beyond the one-step greedy approach in reinforcement learning. Internat. Conf. Machine Learn (PMLR, New York), 1387–1396.Google Scholar
Fei Y, Yang Z, Chen Y, Wang Z, Xie Q (2020) Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 22384–22395.Google Scholar
Goyal V, Grand-Clément J (2022) Robust Markov decision process: Beyond rectangularity. Math. Oper. Res. 47(3):1772–1800.Google Scholar
Hai JL, Petrik M, Ghavamzadeh M, Russel R (2023) RASR: Risk-averse soft-robust MDPs with EVaR and entropic risk. Proc. 26th Internat. Conf. Artificial Intelligence Statist. (AISTATS 2023), vol. 206 (PMLR, New York), 10022–10059.Google Scholar
Iyengar GN (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.Link, Google Scholar
Mannor S, Mebel O, Xu H (2016) Robust MDPs with k-rectangular uncertainty. Math. Oper. Res. 41(4):1484–1509.Link, Google Scholar
Moharrami M, Murthy Y, Roy A, Srikant R (2024) A policy gradient algorithm for the risk-sensitive exponential cost MDP. Math. Oper. Res. 50(1):431–458.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
Smyth M (2002) A spectral theoretic proof of Perron-Frobenius. Math. Proc. Roy. Irish Acad. (JSTOR), 29–35.Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Van der Wal J (1980) Successive approximations for average reward Markov games. Internat. J. Game Theory 9(1):13–24.Crossref, Google Scholar
Whittle P (1990) Risk-Sensitive Optimal Control, vol. 2 (Wiley, Chichester, UK).Google Scholar
Winnicki A, Srikant R (2023) On the convergence of policy iteration-based reinforcement learning with Monte Carlo policy evaluation. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 9852–9878.Google Scholar
Winnicki A, Lubars J, Livesay M, Srikant R (2021) The role of lookahead and approximate policy evaluation in policy iteration with linear value function approximation. Preprint, submitted September 28, https://arxiv.org/abs/2109.13419.Google Scholar
Xu H, Mannor S (2010) Distributionally robust Markov decision processes. Adv. Neural Inform. Processing Systems, vol. 23 (Curran Associates Inc., Red Hook, NY).Google Scholar
Yang I (2017) A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Lett. 1(1):164–169.Crossref, Google Scholar
Zhou K, Doyle JC (1998) Essentials of Robust Control, vol. 104 (Prentice Hall, Upper Saddle River, NJ).Google Scholar

Volume 74, Issue 3

May-June 2026

Pages v-x, 1153-1728, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:February 16, 2024
Accepted:September 30, 2025
Published Online:November 27, 2025

Cite as

Yashaswini Murthy, Mehrdad Moharrami, Rayadurgam Srikant (2025) On the Convergence of Modified Policy Iteration in Risk-Sensitive Exponential Cost Markov Decision Processes. Operations Research 74(3):1425-1436.

https://doi.org/10.1287/opre.2024.0818

Keywords

Acknowledgments

The authors thank the anonymous reviewers, associate editor, and area editor for their helpful feedback in refining the manuscript.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

On the Convergence of Modified Policy Iteration in Risk-Sensitive Exponential Cost Markov Decision Processes

References

Volume 74, Issue 3

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News