A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost MDP

Published Online:https://doi.org/10.1287/moor.2022.0139

References

  • [1] Anantharam V, Borkar VS (2017) A variational formula for risk-sensitive reward. SIAM J. Control Optim. 55(2):961–988.CrossrefGoogle Scholar
  • [2] Andrieu C, Moulines E, Priouret P (2005) Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim. 44(1):283–312.CrossrefGoogle Scholar
  • [3] Balaji S, Meyn S (2000) Multiplicative ergodicity and large deviations for an irreducible Markov chain. Stochastic Processes Their Appl. 90(1):123–144.CrossrefGoogle Scholar
  • [4] Basu A, Bhattacharyya T, Borkar VS (2008) A learning algorithm for risk-sensitive cost. Math. Oper. Res. 33(4):880–898.LinkGoogle Scholar
  • [5] Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. Internat. Conf. Machine Learning (JMLR.org, Sydney, NSW), 449–458.Google Scholar
  • [6] Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • [7] Borkar VS (2001) A sensitivity formula for risk-sensitive cost and the actor-critic algorithm. Systems Control Lett. 44(5):339–346.CrossrefGoogle Scholar
  • [8] Borkar VS (2002) Q-learning for risk-sensitive control. Math. Oper. Res. 27(2):294–311.LinkGoogle Scholar
  • [9] Borkar VS (2010) Learning algorithms for risk-sensitive control. Proc. 19th Internat. Sympos. Math. Theory Networks Systems (MTNS), vol. 5, 1327–1332.Google Scholar
  • [10] Borkar VS, Meyn SP (2002) Risk-sensitive optimal control for Markov decision processes with monotone cost. Math. Oper. Res. 27(1):192–209.LinkGoogle Scholar
  • [11] Chen HF (2001) Convergence and applications of stochastic approximation with state-dependent noise. Proc. 2001 American Control Conf., vol. 2 (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 744–749.Google Scholar
  • [12] Chen HF (2002) Stochastic approximation algorithms with expanding truncations. IFAC Proc. Volumes 35(1):403–408.CrossrefGoogle Scholar
  • [13] Chow Y, Ghavamzadeh M (2014) Algorithms for CVaR optimization in MDPs. Preprint, submitted June 12, https://arxiv.org/abs/1406.3339.Google Scholar
  • [14] Chow Y, Ghavamzadeh M, Janson L, Pavone M (2017) Risk-constrained reinforcement learning with percentile risk criteria. J. Machine Learning Res. 18(1):6070–6120.Google Scholar
  • [15] Chow Y, Tamar A, Mannor S, Pavone M (2015) Risk-sensitive and robust decision-making: A CVaR optimization approach. Preprint, submitted June 6, https://arxiv.org/abs/1506.02188.Google Scholar
  • [16] Dai Pra P, Meneghini L, Runggaldier WJ (1996) Connections between stochastic control and dynamic games. Math. Control Signals Systems 9(4):303–326.CrossrefGoogle Scholar
  • [17] Föllmer H, Knispel T (2011) Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations. Stochastics Dynam. 11(02n03):333–351.CrossrefGoogle Scholar
  • [18] Föllmer H, Schied A (2008) Stochastic Finance (De Gruyter, Berlin, Boston).Google Scholar
  • [19] Karmakar P, Bhatnagar S (2021) On tight bounds for function approximation error in risk-sensitive reinforcement learning. Systems Control Lett. 150:104899.CrossrefGoogle Scholar
  • [20] Kontoyiannis I, Meyn SP (2003) Spectral theory and limit theorems for geometrically ergodic Markov processes. Ann. Appl. Probab. 13(1):304–362.CrossrefGoogle Scholar
  • [21] Lei J, Chen HF (2020) Distributed stochastic approximation algorithm with expanding truncations. IEEE Trans. Automatic Control 65(2):664–679.CrossrefGoogle Scholar
  • [22] Marbach P, Tsitsiklis J (2001) Simulation-based optimization of Markov reward processes. IEEE Trans. Automatic Control 46(2):191–209.CrossrefGoogle Scholar
  • [23] Moharami M (2023) Risk-sensitive policy gradient algorithm: Code and implementation details. https://github.com/mmoharami/Policy-Gradient-Risk-Sensitive-Library.Google Scholar
  • [24] Osogami T (2012) Robustness and risk-sensitivity in Markov decision processes. Adv. Neural Inform. Processing Systems, vol. 25 (Curran Associates, Inc., Red Hook, NY), 233–241.Google Scholar
  • [25] Prashanth LA, Fu M (2018) Risk-sensitive reinforcement learning: A constrained optimization viewpoint. Preprint, submitted October 22, https://arxiv.org/abs/1810.09126.Google Scholar
  • [26] Prashanth L, Ghavamzadeh M (2013) Actor-critic algorithms for risk-sensitive MDPs. Adv. Neural Inform. Processing Systems 26:252–260.Google Scholar
  • [27] Prashanth L, Ghavamzadeh M (2016) Variance-constrained actor-critic algorithms for discounted and average reward MDPs. Machine Learning 105(3):367–417.CrossrefGoogle Scholar
  • [28] Rockafellar RT, Uryasev S (2002) Conditional value-at-risk for general loss distributions. J. Banking Finance 26(7):1443–1471.CrossrefGoogle Scholar
  • [29] Singh R, Zhang Q, Chen Y (2020) Improving robustness via risk averse distributional reinforcement learning. Proc. 2nd Conf. Learning Dynamics Control, vol. 120 (PMLR, New York), 958–968.Google Scholar
  • [30] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • [31] Tadić V (1998) Stochastic approximation with random truncations, state-dependent noise and discontinuous dynamics. Stochastics Stochastic Rep. 64(3–4):283–326.CrossrefGoogle Scholar
  • [32] Tamar A, Di Castro D, Mannor S (2013) Temporal difference methods for the variance of the reward to go. Proc. 33rd Internat. Conf. Machine Learning, vol. 28 (PMLR, New York), 495–503.Google Scholar
  • [33] Tamar A, Di Castro D, Mannor S (2016) Learning the variance of the reward-to-go. J. Machine Learning Res. 17(1):361–396.Google Scholar
  • [34] Tamar A, Glassner Y, Mannor S (2015) Optimizing the CVaR via sampling. Proc. AAAI Conf. Artificial Intelligence, vol. 29 (Association for the Advancement of Artificial Intelligence, Washington, DC).Google Scholar
  • [35] Whittle P (1982) Optimization Over Time (John Wiley & Sons, Inc., New York).Google Scholar
  • [36] Whittle P (1990) Risk-Sensitive Optimal Control, vol. 2 (Wiley, New York).Google Scholar
  • [37] Zhang K, Zhang X, Hu B, Basar T (2021) Derivative-free policy optimization for linear risk-sensitive and robust control design: Implicit regularization and sample complexity. Adv. Neural Inform. Processing Systems 34:2949–2964.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.