Beyond Discounted Returns: Robust Markov Decision Processes with Average and Blackwell Optimality

Julien Grand-Clément
Corresponding Author
Julien Grand-Clément
[email protected]
https://orcid.org/0000-0002-2864-8779
Information Systems and Operations Management Department, Ecole des Hautes Etudes Commerciales (HEC) de Paris, 78350 Jouy-en-Josas, France
Search for more papers by this author
,
Marek Petrik
Marek Petrik
[email protected]
https://orcid.org/0000-0002-4568-7948
Department of Computer Science, University of New Hampshire, Durham, New Hampshire 03824
Search for more papers by this author
,
Nicolas Vieille
Nicolas Vieille
[email protected]
https://orcid.org/0009-0009-2031-9442
Economics and Decision Sciences Department, Ecole des Hautes Etudes Commerciales (HEC) de Paris, 78350 Jouy-en-Josas, France
Search for more papers by this author

Corresponding Author

Julien Grand-Clément

Information Systems and Operations Management Department, Ecole des Hautes Etudes Commerciales (HEC) de Paris, 78350 Jouy-en-Josas, France

Search for more papers by this author

Marek Petrik

[email protected]

https://orcid.org/0000-0002-4568-7948

Department of Computer Science, University of New Hampshire, Durham, New Hampshire 03824

Search for more papers by this author

Nicolas Vieille

[email protected]

https://orcid.org/0009-0009-2031-9442

Economics and Decision Sciences Department, Ecole des Hautes Etudes Commerciales (HEC) de Paris, 78350 Jouy-en-Josas, France

Search for more papers by this author

Published Online:4 Mar 2026https://doi.org/10.1287/opre.2023.0694

References

Akian M, Gaubert S (2003) Spectral theorem for convex monotone homogeneous maps, and ergodic control. Nonlinear Anal. (Oxf.): Theory Methods Appl. 52(2):637–679.Crossref, Google Scholar
Akian M, Gaubert S, Nussbaum R (2016) Uniqueness of the fixed point of nonexpansive semidifferentiable maps. Trans. Amer. Math. Soc. 368(2):1271–1320.Crossref, Google Scholar
Akian M, Gaubert S, Grand-Clément J, Guillaud J (2019) The operator approach to entropy games. Theory Comput. Syst. 63(5):1089–1130.Crossref, Google Scholar
Andersson D, Miltersen PB (2009) The complexity of solving stochastic games on graphs. Internat. Sympos. Algorithms Comput. (Springer, Berlin, Heidelberg), 112–121. Crossref, Google Scholar
Archibald T, McKinnon K, Thomas L (1995) On the generation of Markov decision processes. J. Oper. Res. Soc. 46(3):354–361.Crossref, Google Scholar
Bäuerle N, Rieder U (2011) Markov Decision Processes with Applications to Finance (Springer Science & Business Media, Berlin, Heidelberg).Crossref, Google Scholar
Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J. Artificial Intelligence Res. 15:319–350. Crossref, Google Scholar
Behzadian B, Petrik M, Ho CP (2021) Fast algorithms for l∞ constrained s-rectangular robust MDPs. Adv. Neural Inform. Processing Systems 34:25982–25992.Google Scholar
Bennett CC, Hauser K (2013) Artificial intelligence framework for simulating clinical decision-making: A Markov decision process approach. Artificial Intelligence Med. 57(1):9–19.Crossref, Google Scholar
Bewley T, Kohlberg E (1976) The asymptotic theory of stochastic games. Math. Oper. Res. 1(3):197–208.Link, Google Scholar
Bierth KJ (1987) An expected average reward criterion. Stochastic Processes Appl. 26:123–140.Crossref, Google Scholar
Blackwell D, Ferguson TS (1968) The big match. Ann. Math. Statist. 39(1):159–163.Crossref, Google Scholar
Bolte J, Gaubert S, Vigeral G (2015) Definable zero-sum stochastic games. Math. Oper. Res. 40(1):171–191.Link, Google Scholar
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI gym. Preprint, submitted June 5, https://arxiv.org/abs/1606.01540.Google Scholar
Chae J, Han S, Jung W, Cho M, Choi S, Sung Y (2022) Robust imitation learning against variations in environment dynamics. Internat. Conf. Machine Learn. (PMLR, New York), 2828–2852.Google Scholar
Condon A (1992) The complexity of stochastic games. Inform. Comput. 96(2):203–224.Crossref, Google Scholar
Cordwell S, Gonzalez Y, Tulabandhula T (2015) Markov Decision Process (MDP) toolbox for python. https://github.com/sawcordwell/pymdptoolbox.Google Scholar
Coste M (2000) An Introduction to o-Minimal Geometry (Istituti editoriali e poligrafici internazionali Pisa, Pisa).Google Scholar
Delage E, Mannor S (2010) Percentile optimization for Markov decision processes with parameter uncertainty. Oper. Res. 58(1):203–213.Link, Google Scholar
Deng Y, Bao F, Kong Y, Ren Z, Dai Q (2016) Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans. Neural Networks Learn. Systems 28(3):653–664.Crossref, Google Scholar
Dewanto V, Dunn G, Eshragh A, Gallagher M, Roosta F (2020) Average-reward model-free reinforcement learning: A systematic review and literature mapping. Preprint, submitted October 18, https://arxiv.org/abs/2010.08920.Google Scholar
Everett H (1957) Recursive games. Contributions Theory Games 3(39):47–78.Google Scholar
Feinberg EA, Shwartz A (2012) Handbook of Markov Decision Processes: Methods and Applications, vol. 40 (Springer Science & Business Media, New York).Google Scholar
Gillette D (1957) Stochastic games with zero stop probabilities. Contributions Theory Games 3:179–187.Google Scholar
Givan R, Leach S, Dean T (1997) Bounded parameter Markov decision processes. Eur. Conf. Planning (Springer, Berlin, Heidelberg), 234–246.Google Scholar
Goh J, Bayati M, Zenios SA, Singh S, Moore D (2018) Data uncertainty in Markov chains: Application to cost-effectiveness analyses of medical innovations. Oper. Res. 66(3):697–715.Link, Google Scholar
Goyal V, Grand-Clément J (2023) Robust Markov decision processes: Beyond rectangularity. Math. Oper. Res. 48(1):203–226.Link, Google Scholar
Grand-Clément J, Kroer C (2021) First-order methods for Wasserstein distributionally robust MDP. Internat. Conf. Machine Learn (PMLR, New York), 2010–2019.Google Scholar
Grand-Clément J, Petrik M (2024) Reducing Blackwell and average optimality to discounted MDPS via the Blackwell discount factor. Adv. Neural Inform. Processing Systems 36.Google Scholar
Grand-Clément J, Petrik M (2025) On the convex formulations of robust Markov decision processes. Math. Oper. Res. 50(3):1681–1706.Link, Google Scholar
Grand-Clément J, Chan CW, Goyal V, Escobar G (2023) Robustness of proactive intensive care unit transfer policies. Oper. Res. 71(5):1653–1688.Link, Google Scholar
Hansen TD, Miltersen PB, Zwick U (2013) Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1):1–16.Crossref, Google Scholar
Ho CP, Petrik M, Wiesemann W (2021) Partial policy iteration for L1-robust Markov decision processes. J. Machine Learn. Res. 22(1):12612–12657.Google Scholar
Ho CP, Petrik M, Wiesemann W (2022) Robust phi-divergence MDPs. Adv. Neural Inform. Processing Systems 35:32680–32693.Crossref, Google Scholar
Iyengar G (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.Link, Google Scholar
Kohlberg E (1974) Repeated games with absorbing states. Ann. Statist. 724–738. Google Scholar
Kumar N, Derman E, Geist M, Levy KY, Mannor S (2023) Policy gradient for rectangular robust Markov decision processes. Adv. Neural Inform. Processing Systems 36:59477–59501.Google Scholar
Laraki R, Sorin S (2015) Advances in zero-sum dynamic games. Handbook of Game Theory with Economic Applications, vol. 4 (Elsevier, Amsterdam), 27–93.Google Scholar
Le Tallec Y (2007) Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, Cambridge.Google Scholar
Leizarowitz A (2003) An algorithm to identify and compute average optimal policies in multichain Markov decision processes. Math. Oper. Res. 28(3):553–586.Link, Google Scholar
Li M, Sutter T, Kuhn D (2023) Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets. Preprint, submitted May 30, https://arxiv.org/abs/2305.19004.Google Scholar
Li Y, Zhao T, Lan G (2022) First-order policy optimization for robust Markov decision process. Preprint, submitted September 21, https://arxiv.org/abs/2209.10579.Google Scholar
Liggett TM, Lippman SA (1969) Stochastic games with perfect information and time average payoff. SIAM Rev. 11(4):604–607.Crossref, Google Scholar
Mannor S, Mebel O, Xu H (2016) Robust MDPs with k-rectangular uncertainty. Math. Oper. Res. 41(4):1484–1509.Link, Google Scholar
Mertens JF, Sorin S, Zamir S (2015) Repeated Games, vol. 55 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
Neyman A, Sorin S, Sorin S (2003) Stochastic Games and Applications, vol. 570 (Springer Science & Business Media, New York).Crossref, Google Scholar
Nilim A, Ghaoui LE (2005) Robust control of Markov decision processes with uncertain transition probabilities. Oper. Res. 53(5):780–798.Link, Google Scholar
Panaganti K, Kalathil D (2022) Sample complexity of robust reinforcement learning with a generative model. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 9582–9602.Google Scholar
Possingham H, Tuck G (1997) Application of stochastic dynamic programming to optimal fire management of a spatially structured threatened species. Proc. Internat. Congress Modelling Simulation, MODSIM, 813–817.Google Scholar
Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
Ramani S, Ghate A (2022) Robust Markov decision processes with data-driven, distance-based ambiguity sets. SIAM J. Optim. 32(2):989–1017.Crossref, Google Scholar
Ramani S, Ghate A (2024) A family of-rectangular robust MDPs: Relative conservativeness, asymptotic analyses, and finite-sample properties. SIAM J. Optim. 34(2):1540–1568.Crossref, Google Scholar
Renault J (2019) A tutorial on zero-sum stochastic games. Preprint, submitted May 16, https://arxiv.org/abs/1905.06577.Google Scholar
Satia J, Lave R (1973) Markov decision processes with uncertain transition probabilities. Oper. Res. 21(3):728–740.Link, Google Scholar
Shapley LS (1953) Stochastic games. Proc. Natl. Acad. Sci. USA 39(10):1095–1100.Crossref, Google Scholar
Sorin S (2002) A First Course on Zero-Sum Repeated Games, vol. 37 (Springer Science & Business Media, New York).Google Scholar
Steimle LN, Denton BT (2017) Markov decision processes for screening and treatment of chronic diseases. Markov Decision Processes in Practice (Springer, Cham, Switzerland), 189–222. Crossref, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Tewari A, Bartlett PL (2007) Bounded parameter Markov decision processes with average reward criterion. Internat. Conf. Comput. Learn. Theory (Springer, Berlin, Heidelberg), 263–277.Google Scholar
Van Den Dries L (1998) O-minimal structures and real analytic geometry. Current Developments Math. 1998(1):105–152.Crossref, Google Scholar
Viano L, Huang Y-T, Kamalaruban P, Weller A, Cevher V (2021) Robust inverse reinforcement learning under transition dynamics mismatch. Adv. Neural Inform. Processing Systems 34:25917–25931.Google Scholar
Villani C (2021) Topics in Optimal Transportation, vol. 58 (American Mathematical Society, Providence, RI).Google Scholar
Wang Q, Ho CP, Petrik M (2023a) Policy gradient in robust MDPs with global convergence guarantee. ICML, Proceedings of Machine Learning Research, vol. 202 (PMLR, New York), 35763–35797.Google Scholar
Wang Y, Velasquez A, Atia G, Prater-Bennette A, Zou S (2023b) Robust average-reward Markov decision processes. Proc. AAAI Conf. Artificial Intelligence 37:15215–15223.Google Scholar
Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math. Oper. Res. 38(1):153–183.Link, Google Scholar
Xu H, Mannor S (2010) Distributionally robust Markov decision processes. Adv. Neural Inform. Processing Systems 23.Google Scholar
Zhang Y, Steimle L, Denton B (2017) Robust Markov decision processes for medical treatment decisions. Optim. Online.Google Scholar
Zwick U, Paterson M (1996) The complexity of mean payoff games on graphs. Theoret. Comput. Sci. 158(1–2):343–359.Crossref, Google Scholar

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:January 02, 2024
Accepted:January 13, 2026
Published Online:March 04, 2026

Cite as

Julien Grand-Clément, Marek Petrik, Nicolas Vieille (2026) Beyond Discounted Returns: Robust Markov Decision Processes with Average and Blackwell Optimality. Operations Research 0(0).

https://doi.org/10.1287/opre.2023.0694

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Beyond Discounted Returns: Robust Markov Decision Processes with Average and Blackwell Optimality

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News