Upper Bounds for All and Max-Gain Policy Iteration Algorithms on Deterministic MDPs

Ritesh Goenka
Corresponding Author
Ritesh Goenka
[email protected]
https://orcid.org/0000-0002-5004-8112
Mathematical Institute, University of Oxford, Oxford OX2 6GG, United Kingdom
Search for more papers by this author
,
Eashan Gupta
Eashan Gupta
[email protected]
https://orcid.org/0000-0002-9266-8731
Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois 61820
Search for more papers by this author
,
Sushil Khyalia
Sushil Khyalia
[email protected]
https://orcid.org/0009-0007-5688-3315
Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
Search for more papers by this author
,
Shivaram Kalyanakrishnan
Shivaram Kalyanakrishnan
[email protected]
https://orcid.org/0009-0006-7707-6056
Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai 400076, India
Search for more papers by this author

Ritesh Goenka

Corresponding Author

Ritesh Goenka

[email protected]

https://orcid.org/0000-0002-5004-8112

Mathematical Institute, University of Oxford, Oxford OX2 6GG, United Kingdom

Search for more papers by this author

Eashan Gupta

[email protected]

https://orcid.org/0000-0002-9266-8731

Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois 61820

Search for more papers by this author

Sushil Khyalia

[email protected]

https://orcid.org/0009-0007-5688-3315

Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213

Search for more papers by this author

Shivaram Kalyanakrishnan

[email protected]

https://orcid.org/0009-0006-7707-6056

Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai 400076, India

Search for more papers by this author

Published Online:10 Apr 2025https://doi.org/10.1287/moor.2023.0317

Abstract

Policy iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov decision problems (MDPs). Howard’s [Howard RA (1960) Dynamic Programming and Markov Processes (MIT Press, Cambridge, MA)] PI is one of the most commonly used algorithms from this family. Despite its popularity, theoretical analysis of the running-time complexity of Howard’s PI has remained elusive. For n-state, two-action MDPs, the best known lower and upper bounds are $Ω (n)$ and $O (2^{n} / n)$ iterations, respectively. Based on computational evidence for a combinatorial relaxation of this problem, Hansen [Hansen TD (2012) Worst-case analysis of strategy iteration and the simplex method. Unpublished PhD thesis, Aarhus University, Aarhus, Denmark] conjectured that the upper bound can be improved to $O (ϕ^{n})$ , where $ϕ = (1 + \sqrt{5}) / 2$ is the golden ratio. We prove this conjecture for deterministic MDPs (DMDPs), albeit up to a poly(n) factor. More generally, we derive a nontrivial upper bound for DMDPs that applies to the entire family of PI algorithms. We also derive an improved bound that applies to all “max-gain” switching variants. These bounds hold both under discounted and average reward settings. Combined with a result of Melekopoglou and Condon [Melekopoglou M, Condon A (1994) On the complexity of the policy improvement algorithm for Markov decision processes. ORSA J. Comput. 6(2):188–192], our results imply that stochasticity makes two-action MDPs harder to solve for PI. Our analysis is based on certain graph-theoretic results, which may be of independent interest.

cover image Mathematics of Operations Research

Volume 51, Issue 1

February 2026

Pages iv-viii, 1-851

Article Information

Metrics

Information

Received:October 16, 2023
Accepted:February 20, 2025
Published Online:April 10, 2025

Cite as

Ritesh Goenka, Eashan Gupta, Sushil Khyalia, Shivaram Kalyanakrishnan (2025) Upper Bounds for All and Max-Gain Policy Iteration Algorithms on Deterministic MDPs. Mathematics of Operations Research 51(1):806-828.

https://doi.org/10.1287/moor.2023.0317

Keywords

Acknowledgments

The authors thank Pratyush Agarwal and Mulinti Shaik Wajid for carefully reading this manuscript and providing helpful suggestions. The authors also thank the anonymous reviewers and editors for their valuable comments, which helped improve the quality of this paper. Eashan Gupta and Sushil Khyalia contributed equally.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Upper Bounds for All and Max-Gain Policy Iteration Algorithms on Deterministic MDPs

Abstract

Volume 51, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News