On the Convergence of Modified Policy Iteration in Risk-Sensitive Exponential Cost Markov Decision Processes

Published Online:https://doi.org/10.1287/opre.2024.0818

References

  • Başar T, Bernhard P (2008) H-Infinity Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach (Springer Science & Business Media, New York).CrossrefGoogle Scholar
  • Basu A, Bhattacharyya T, Borkar VS (2008) A learning algorithm for risk-sensitive cost. Math. Oper. Res. 33(4):880–898.LinkGoogle Scholar
  • Bertsekas D (2012a) Dynamic Programming and Optimal Control: Volume I (Athena Scientific, Belmont, MA).Google Scholar
  • Bertsekas D (2012b) Dynamic Programming and Optimal Control: Volume II; Approximate Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Bielecki T, Hernández-Hernández D, Pliska SR (1999) Risk sensitive control of finite state Markov chains in discrete time, with applications to portfolio management. Math. Methods Oper. Res. 50(2):167–188.CrossrefGoogle Scholar
  • Borkar VS (2001) A sensitivity formula for risk-sensitive cost and the actor–critic algorithm. Systems Control Lett. 44(5):339–346.CrossrefGoogle Scholar
  • Borkar VS (2002) Q-learning for risk-sensitive control. Math. Oper. Res. 27(2):294–311.LinkGoogle Scholar
  • Borkar VS (2010) Learning algorithms for risk-sensitive control. Proc. 19th Internat. Sympos. Math. Theory Networks Systems (Budapest), vol. 5.Google Scholar
  • Borkar VS, Meyn SP (2002) Risk-sensitive optimal control for Markov decision processes with monotone cost. Math. Oper. Res. 27(1):192–209.LinkGoogle Scholar
  • Cavazos-Cadena R, Montes-de Oca R (2003) The value iteration algorithm in risk-sensitive average Markov decision chains with finite state space. Math. Oper. Res. 28(4):752–776.LinkGoogle Scholar
  • Chen Z, Yu P, Haskell WB (2019) Distributionally robust optimization for sequential decision-making. Optimization 68(12):2397–2426.CrossrefGoogle Scholar
  • Clement JG, Kroer C (2021) First-order methods for Wasserstein distributionally robust MDP. Internat. Conf. Machine Learn., 2010–2019 (PMLR, New York).Google Scholar
  • Donsker MD, Varadhan SS (1975) On a variational formula for the principal eigenvalue for operators with maximum principle. Proc. Natl. Acad. Sci. USA 72(3):780–783.CrossrefGoogle Scholar
  • Dullerud GE, Paganini F (2013) A Course in Robust Control Theory: A Convex Approach, vol. 36 (Springer Science & Business Media, New York).Google Scholar
  • Efroni Y, Dalal G, Scherrer B, Mannor S (2018) Beyond the one-step greedy approach in reinforcement learning. Internat. Conf. Machine Learn (PMLR, New York), 1387–1396.Google Scholar
  • Fei Y, Yang Z, Chen Y, Wang Z, Xie Q (2020) Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 22384–22395.Google Scholar
  • Goyal V, Grand-Clément J (2022) Robust Markov decision process: Beyond rectangularity. Math. Oper. Res. 47(3):1772–1800.Google Scholar
  • Hai JL, Petrik M, Ghavamzadeh M, Russel R (2023) RASR: Risk-averse soft-robust MDPs with EVaR and entropic risk. Proc. 26th Internat. Conf. Artificial Intelligence Statist. (AISTATS 2023), vol. 206 (PMLR, New York), 10022–10059.Google Scholar
  • Iyengar GN (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.LinkGoogle Scholar
  • Mannor S, Mebel O, Xu H (2016) Robust MDPs with k-rectangular uncertainty. Math. Oper. Res. 41(4):1484–1509.LinkGoogle Scholar
  • Moharrami M, Murthy Y, Roy A, Srikant R (2024) A policy gradient algorithm for the risk-sensitive exponential cost MDP. Math. Oper. Res. 50(1):431–458.Google Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
  • Smyth M (2002) A spectral theoretic proof of Perron-Frobenius. Math. Proc. Roy. Irish Acad. (JSTOR), 29–35.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Van der Wal J (1980) Successive approximations for average reward Markov games. Internat. J. Game Theory 9(1):13–24.CrossrefGoogle Scholar
  • Whittle P (1990) Risk-Sensitive Optimal Control, vol. 2 (Wiley, Chichester, UK).Google Scholar
  • Winnicki A, Srikant R (2023) On the convergence of policy iteration-based reinforcement learning with Monte Carlo policy evaluation. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 9852–9878.Google Scholar
  • Winnicki A, Lubars J, Livesay M, Srikant R (2021) The role of lookahead and approximate policy evaluation in policy iteration with linear value function approximation. Preprint, submitted September 28, https://arxiv.org/abs/2109.13419.Google Scholar
  • Xu H, Mannor S (2010) Distributionally robust Markov decision processes. Adv. Neural Inform. Processing Systems, vol. 23 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Yang I (2017) A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Lett. 1(1):164–169.CrossrefGoogle Scholar
  • Zhou K, Doyle JC (1998) Essentials of Robust Control, vol. 104 (Prentice Hall, Upper Saddle River, NJ).Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.