Beyond Discounted Returns: Robust Markov Decision Processes with Average and Blackwell Optimality

Published Online:https://doi.org/10.1287/opre.2023.0694

References

  • Akian M, Gaubert S (2003) Spectral theorem for convex monotone homogeneous maps, and ergodic control. Nonlinear Anal. (Oxf.): Theory Methods Appl. 52(2):637–679.CrossrefGoogle Scholar
  • Akian M, Gaubert S, Nussbaum R (2016) Uniqueness of the fixed point of nonexpansive semidifferentiable maps. Trans. Amer. Math. Soc. 368(2):1271–1320.CrossrefGoogle Scholar
  • Akian M, Gaubert S, Grand-Clément J, Guillaud J (2019) The operator approach to entropy games. Theory Comput. Syst. 63(5):1089–1130.CrossrefGoogle Scholar
  • Andersson D, Miltersen PB (2009) The complexity of solving stochastic games on graphs. Internat. Sympos. Algorithms Comput. (Springer, Berlin, Heidelberg), 112–121. CrossrefGoogle Scholar
  • Archibald T, McKinnon K, Thomas L (1995) On the generation of Markov decision processes. J. Oper. Res. Soc. 46(3):354–361.CrossrefGoogle Scholar
  • Bäuerle N, Rieder U (2011) Markov Decision Processes with Applications to Finance (Springer Science & Business Media, Berlin, Heidelberg).CrossrefGoogle Scholar
  • Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J. Artificial Intelligence Res. 15:319–350. CrossrefGoogle Scholar
  • Behzadian B, Petrik M, Ho CP (2021) Fast algorithms for l∞ constrained s-rectangular robust MDPs. Adv. Neural Inform. Processing Systems 34:25982–25992.Google Scholar
  • Bennett CC, Hauser K (2013) Artificial intelligence framework for simulating clinical decision-making: A Markov decision process approach. Artificial Intelligence Med. 57(1):9–19.CrossrefGoogle Scholar
  • Bewley T, Kohlberg E (1976) The asymptotic theory of stochastic games. Math. Oper. Res. 1(3):197–208.LinkGoogle Scholar
  • Bierth KJ (1987) An expected average reward criterion. Stochastic Processes Appl. 26:123–140.CrossrefGoogle Scholar
  • Blackwell D, Ferguson TS (1968) The big match. Ann. Math. Statist. 39(1):159–163.CrossrefGoogle Scholar
  • Bolte J, Gaubert S, Vigeral G (2015) Definable zero-sum stochastic games. Math. Oper. Res. 40(1):171–191.LinkGoogle Scholar
  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI gym. Preprint, submitted June 5, https://arxiv.org/abs/1606.01540.Google Scholar
  • Chae J, Han S, Jung W, Cho M, Choi S, Sung Y (2022) Robust imitation learning against variations in environment dynamics. Internat. Conf. Machine Learn. (PMLR, New York), 2828–2852.Google Scholar
  • Condon A (1992) The complexity of stochastic games. Inform. Comput. 96(2):203–224.CrossrefGoogle Scholar
  • Cordwell S, Gonzalez Y, Tulabandhula T (2015) Markov Decision Process (MDP) toolbox for python. https://github.com/sawcordwell/pymdptoolbox.Google Scholar
  • Coste M (2000) An Introduction to o-Minimal Geometry (Istituti editoriali e poligrafici internazionali Pisa, Pisa).Google Scholar
  • Delage E, Mannor S (2010) Percentile optimization for Markov decision processes with parameter uncertainty. Oper. Res. 58(1):203–213.LinkGoogle Scholar
  • Deng Y, Bao F, Kong Y, Ren Z, Dai Q (2016) Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans. Neural Networks Learn. Systems 28(3):653–664.CrossrefGoogle Scholar
  • Dewanto V, Dunn G, Eshragh A, Gallagher M, Roosta F (2020) Average-reward model-free reinforcement learning: A systematic review and literature mapping. Preprint, submitted October 18, https://arxiv.org/abs/2010.08920.Google Scholar
  • Everett H (1957) Recursive games. Contributions Theory Games 3(39):47–78.Google Scholar
  • Feinberg EA, Shwartz A (2012) Handbook of Markov Decision Processes: Methods and Applications, vol. 40 (Springer Science & Business Media, New York).Google Scholar
  • Gillette D (1957) Stochastic games with zero stop probabilities. Contributions Theory Games 3:179–187.Google Scholar
  • Givan R, Leach S, Dean T (1997) Bounded parameter Markov decision processes. Eur. Conf. Planning (Springer, Berlin, Heidelberg), 234–246.Google Scholar
  • Goh J, Bayati M, Zenios SA, Singh S, Moore D (2018) Data uncertainty in Markov chains: Application to cost-effectiveness analyses of medical innovations. Oper. Res. 66(3):697–715.LinkGoogle Scholar
  • Goyal V, Grand-Clément J (2023) Robust Markov decision processes: Beyond rectangularity. Math. Oper. Res. 48(1):203–226.LinkGoogle Scholar
  • Grand-Clément J, Kroer C (2021) First-order methods for Wasserstein distributionally robust MDP. Internat. Conf. Machine Learn (PMLR, New York), 2010–2019.Google Scholar
  • Grand-Clément J, Petrik M (2024) Reducing Blackwell and average optimality to discounted MDPS via the Blackwell discount factor. Adv. Neural Inform. Processing Systems 36.Google Scholar
  • Grand-Clément J, Petrik M (2025) On the convex formulations of robust Markov decision processes. Math. Oper. Res. 50(3):1681–1706.LinkGoogle Scholar
  • Grand-Clément J, Chan CW, Goyal V, Escobar G (2023) Robustness of proactive intensive care unit transfer policies. Oper. Res. 71(5):1653–1688.LinkGoogle Scholar
  • Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.CrossrefGoogle Scholar
  • Ho CP, Petrik M, Wiesemann W (2021) Partial policy iteration for L1-robust Markov decision processes. J. Machine Learn. Res. 22(1):12612–12657.Google Scholar
  • Ho CP, Petrik M, Wiesemann W (2022) Robust phi-divergence MDPs. Adv. Neural Inform. Processing Systems 35:32680–32693.CrossrefGoogle Scholar
  • Iyengar G (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.LinkGoogle Scholar
  • Kohlberg E (1974) Repeated games with absorbing states. Ann. Statist. 724–738. Google Scholar
  • Kumar N, Derman E, Geist M, Levy KY, Mannor S (2023) Policy gradient for rectangular robust Markov decision processes. Adv. Neural Inform. Processing Systems 36:59477–59501.Google Scholar
  • Laraki R, Sorin S (2015) Advances in zero-sum dynamic games. Handbook of Game Theory with Economic Applications, vol. 4 (Elsevier, Amsterdam), 27–93.Google Scholar
  • Le Tallec Y (2007) Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, Cambridge.Google Scholar
  • Leizarowitz A (2003) An algorithm to identify and compute average optimal policies in multichain Markov decision processes. Math. Oper. Res. 28(3):553–586.LinkGoogle Scholar
  • Li M, Sutter T, Kuhn D (2023) Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets. Preprint, submitted May 30, https://arxiv.org/abs/2305.19004.Google Scholar
  • Li Y, Zhao T, Lan G (2022) First-order policy optimization for robust Markov decision process. Preprint, submitted September 21, https://arxiv.org/abs/2209.10579.Google Scholar
  • Liggett TM, Lippman SA (1969) Stochastic games with perfect information and time average payoff. SIAM Rev. 11(4):604–607.CrossrefGoogle Scholar
  • Mannor S, Mebel O, Xu H (2016) Robust MDPs with k-rectangular uncertainty. Math. Oper. Res. 41(4):1484–1509.LinkGoogle Scholar
  • Mertens JF, Sorin S, Zamir S (2015) Repeated Games, vol. 55 (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
  • Neyman A, Sorin S, Sorin S (2003) Stochastic Games and Applications, vol. 570 (Springer Science & Business Media, New York).CrossrefGoogle Scholar
  • Nilim A, Ghaoui LE (2005) Robust control of Markov decision processes with uncertain transition probabilities. Oper. Res. 53(5):780–798.LinkGoogle Scholar
  • Panaganti K, Kalathil D (2022) Sample complexity of robust reinforcement learning with a generative model. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 9582–9602.Google Scholar
  • Possingham H, Tuck G (1997) Application of stochastic dynamic programming to optimal fire management of a spatially structured threatened species. Proc. Internat. Congress Modelling Simulation, MODSIM, 813–817.Google Scholar
  • Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Ramani S, Ghate A (2022) Robust Markov decision processes with data-driven, distance-based ambiguity sets. SIAM J. Optim. 32(2):989–1017.CrossrefGoogle Scholar
  • Ramani S, Ghate A (2024) A family of-rectangular robust MDPs: Relative conservativeness, asymptotic analyses, and finite-sample properties. SIAM J. Optim. 34(2):1540–1568.CrossrefGoogle Scholar
  • Renault J (2019) A tutorial on zero-sum stochastic games. Preprint, submitted May 16, https://arxiv.org/abs/1905.06577.Google Scholar
  • Satia J, Lave R (1973) Markov decision processes with uncertain transition probabilities. Oper. Res. 21(3):728–740.LinkGoogle Scholar
  • Shapley LS (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.CrossrefGoogle Scholar
  • Sorin S (2002) A First Course on Zero-Sum Repeated Games, vol. 37 (Springer Science & Business Media, New York).Google Scholar
  • Steimle LN, Denton BT (2017) Markov decision processes for screening and treatment of chronic diseases. Markov Decision Processes in Practice (Springer, Cham, Switzerland), 189–222. CrossrefGoogle Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Tewari A, Bartlett PL (2007) Bounded parameter Markov decision processes with average reward criterion. Internat. Conf. Comput. Learn. Theory (Springer, Berlin, Heidelberg), 263–277.Google Scholar
  • Van Den Dries L (1998) O-minimal structures and real analytic geometry. Current Developments Math. 1998(1):105–152.CrossrefGoogle Scholar
  • Viano L, Huang Y-T, Kamalaruban P, Weller A, Cevher V (2021) Robust inverse reinforcement learning under transition dynamics mismatch. Adv. Neural Inform. Processing Systems 34:25917–25931.Google Scholar
  • Villani C (2021) Topics in Optimal Transportation, vol. 58 (American Mathematical Society, Providence, RI).Google Scholar
  • Wang Q, Ho CP, Petrik M (2023a) Policy gradient in robust MDPs with global convergence guarantee. ICML, Proceedings of Machine Learning Research, vol. 202 (PMLR, New York), 35763–35797.Google Scholar
  • Wang Y, Velasquez A, Atia G, Prater-Bennette A, Zou S (2023b) Robust average-reward Markov decision processes. Proc. AAAI Conf. Artificial Intelligence 37:15215–15223.Google Scholar
  • Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math. Oper. Res. 38(1):153–183.LinkGoogle Scholar
  • Xu H, Mannor S (2010) Distributionally robust Markov decision processes. Adv. Neural Inform. Processing Systems 23.Google Scholar
  • Zhang Y, Steimle L, Denton B (2017) Robust Markov decision processes for medical treatment decisions. Optim. Online.Google Scholar
  • Zwick U, Paterson M (1996) The complexity of mean payoff games on graphs. Theoret. Comput. Sci. 158(1–2):343–359.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.