A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost MDP

Mehrdad Moharrami
Corresponding Author
Mehrdad Moharrami
[email protected]
https://orcid.org/0000-0003-3907-8406
Computer Science Department, University of Iowa, Iowa City, Iowa 52242
Search for more papers by this author
,
Yashaswini Murthy
Yashaswini Murthy
[email protected]
https://orcid.org/0000-0002-8788-6873
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801; and Department of Electrical & Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author
,
Arghyadip Roy
Arghyadip Roy
[email protected]
https://orcid.org/0000-0001-9955-9514
Mehta Family School of Data Science and Artificial Intelligence, Indian Institute of Technology Guwahati, Guwahati, Assam 781039, India
Search for more papers by this author
,
R. Srikant
R. Srikant
[email protected]
https://orcid.org/0000-0003-1483-5204
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801; and Department of Electrical & Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author

Mehrdad Moharrami

Corresponding Author

Mehrdad Moharrami

[email protected]

https://orcid.org/0000-0003-3907-8406

Computer Science Department, University of Iowa, Iowa City, Iowa 52242

Search for more papers by this author

Yashaswini Murthy

[email protected]

https://orcid.org/0000-0002-8788-6873

Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801; and Department of Electrical & Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Arghyadip Roy

[email protected]

https://orcid.org/0000-0001-9955-9514

Mehta Family School of Data Science and Artificial Intelligence, Indian Institute of Technology Guwahati, Guwahati, Assam 781039, India

Search for more papers by this author

R. Srikant

[email protected]

https://orcid.org/0000-0003-1483-5204

Search for more papers by this author

Published Online:11 Mar 2024https://doi.org/10.1287/moor.2022.0139

References

[1] Anantharam V, Borkar VS (2017) A variational formula for risk-sensitive reward. SIAM J. Control Optim. 55(2):961–988.Crossref, Google Scholar
[2] Andrieu C, Moulines E, Priouret P (2005) Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim. 44(1):283–312.Crossref, Google Scholar
[3] Balaji S, Meyn S (2000) Multiplicative ergodicity and large deviations for an irreducible Markov chain. Stochastic Processes Their Appl. 90(1):123–144.Crossref, Google Scholar
[4] Basu A, Bhattacharyya T, Borkar VS (2008) A learning algorithm for risk-sensitive cost. Math. Oper. Res. 33(4):880–898.Link, Google Scholar
[5] Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. Internat. Conf. Machine Learning (JMLR.org, Sydney, NSW), 449–458.Google Scholar
[6] Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
[7] Borkar VS (2001) A sensitivity formula for risk-sensitive cost and the actor-critic algorithm. Systems Control Lett. 44(5):339–346.Crossref, Google Scholar
[8] Borkar VS (2002) Q-learning for risk-sensitive control. Math. Oper. Res. 27(2):294–311.Link, Google Scholar
[9] Borkar VS (2010) Learning algorithms for risk-sensitive control. Proc. 19th Internat. Sympos. Math. Theory Networks Systems (MTNS), vol. 5, 1327–1332.Google Scholar
[10] Borkar VS, Meyn SP (2002) Risk-sensitive optimal control for Markov decision processes with monotone cost. Math. Oper. Res. 27(1):192–209.Link, Google Scholar
[11] Chen HF (2001) Convergence and applications of stochastic approximation with state-dependent noise. Proc. 2001 American Control Conf., vol. 2 (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 744–749.Google Scholar
[12] Chen HF (2002) Stochastic approximation algorithms with expanding truncations. IFAC Proc. Volumes 35(1):403–408.Crossref, Google Scholar
[13] Chow Y, Ghavamzadeh M (2014) Algorithms for CVaR optimization in MDPs. Preprint, submitted June 12, https://arxiv.org/abs/1406.3339.Google Scholar
[14] Chow Y, Ghavamzadeh M, Janson L, Pavone M (2017) Risk-constrained reinforcement learning with percentile risk criteria. J. Machine Learning Res. 18(1):6070–6120.Google Scholar
[15] Chow Y, Tamar A, Mannor S, Pavone M (2015) Risk-sensitive and robust decision-making: A CVaR optimization approach. Preprint, submitted June 6, https://arxiv.org/abs/1506.02188.Google Scholar
[16] Dai Pra P, Meneghini L, Runggaldier WJ (1996) Connections between stochastic control and dynamic games. Math. Control Signals Systems 9(4):303–326.Crossref, Google Scholar
[17] Föllmer H, Knispel T (2011) Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations. Stochastics Dynam. 11(02n03):333–351.Crossref, Google Scholar
[18] Föllmer H, Schied A (2008) Stochastic Finance (De Gruyter, Berlin, Boston).Google Scholar
[19] Karmakar P, Bhatnagar S (2021) On tight bounds for function approximation error in risk-sensitive reinforcement learning. Systems Control Lett. 150:104899.Crossref, Google Scholar
[20] Kontoyiannis I, Meyn SP (2003) Spectral theory and limit theorems for geometrically ergodic Markov processes. Ann. Appl. Probab. 13(1):304–362.Crossref, Google Scholar
[21] Lei J, Chen HF (2020) Distributed stochastic approximation algorithm with expanding truncations. IEEE Trans. Automatic Control 65(2):664–679.Crossref, Google Scholar
[22] Marbach P, Tsitsiklis J (2001) Simulation-based optimization of Markov reward processes. IEEE Trans. Automatic Control 46(2):191–209.Crossref, Google Scholar
[23] Moharami M (2023) Risk-sensitive policy gradient algorithm: Code and implementation details. https://github.com/mmoharami/Policy-Gradient-Risk-Sensitive-Library.Google Scholar
[24] Osogami T (2012) Robustness and risk-sensitivity in Markov decision processes. Adv. Neural Inform. Processing Systems, vol. 25 (Curran Associates, Inc., Red Hook, NY), 233–241.Google Scholar
[25] Prashanth LA, Fu M (2018) Risk-sensitive reinforcement learning: A constrained optimization viewpoint. Preprint, submitted October 22, https://arxiv.org/abs/1810.09126.Google Scholar
[26] Prashanth L, Ghavamzadeh M (2013) Actor-critic algorithms for risk-sensitive MDPs. Adv. Neural Inform. Processing Systems 26:252–260.Google Scholar
[27] Prashanth L, Ghavamzadeh M (2016) Variance-constrained actor-critic algorithms for discounted and average reward MDPs. Machine Learning 105(3):367–417.Crossref, Google Scholar
[28] Rockafellar RT, Uryasev S (2002) Conditional value-at-risk for general loss distributions. J. Banking Finance 26(7):1443–1471.Crossref, Google Scholar
[29] Singh R, Zhang Q, Chen Y (2020) Improving robustness via risk averse distributional reinforcement learning. Proc. 2nd Conf. Learning Dynamics Control, vol. 120 (PMLR, New York), 958–968.Google Scholar
[30] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
[31] Tadić V (1998) Stochastic approximation with random truncations, state-dependent noise and discontinuous dynamics. Stochastics Stochastic Rep. 64(3–4):283–326.Crossref, Google Scholar
[32] Tamar A, Di Castro D, Mannor S (2013) Temporal difference methods for the variance of the reward to go. Proc. 33rd Internat. Conf. Machine Learning, vol. 28 (PMLR, New York), 495–503.Google Scholar
[33] Tamar A, Di Castro D, Mannor S (2016) Learning the variance of the reward-to-go. J. Machine Learning Res. 17(1):361–396.Google Scholar
[34] Tamar A, Glassner Y, Mannor S (2015) Optimizing the CVaR via sampling. Proc. AAAI Conf. Artificial Intelligence, vol. 29 (Association for the Advancement of Artificial Intelligence, Washington, DC).Google Scholar
[35] Whittle P (1982) Optimization Over Time (John Wiley & Sons, Inc., New York).Google Scholar
[36] Whittle P (1990) Risk-Sensitive Optimal Control, vol. 2 (Wiley, New York).Google Scholar
[37] Zhang K, Zhang X, Hu B, Basar T (2021) Derivative-free policy optimization for linear risk-sensitive and robust control design: Implicit regularization and sample complexity. Adv. Neural Inform. Processing Systems 34:2949–2964.Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 1

February 2025

Pages 1-781 C2

Article Information

Metrics

Information

Received:May 17, 2022
Accepted:January 01, 2024
Published Online:March 11, 2024

Cite as

Mehrdad Moharrami; , Yashaswini Murthy; , Arghyadip Roy, R. Srikant; (2024) A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost MDP. Mathematics of Operations Research 50(1):431-458.

https://doi.org/10.1287/moor.2022.0139

Keywords

Acknowledgments

The work presented here was supported in part by the NSF grants CCF 19-34986, CNS 21-06801, CCF 17-04970, ARO Grant W911NF-19-1-0379, and ONR grant N00014-19-1-2566.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost MDP

References

Volume 50, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News