Upper Bounds for All and Max-Gain Policy Iteration Algorithms on Deterministic MDPs
Abstract
Policy iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov decision problems (MDPs). Howard’s [Howard RA (1960) Dynamic Programming and Markov Processes (MIT Press, Cambridge, MA)] PI is one of the most commonly used algorithms from this family. Despite its popularity, theoretical analysis of the running-time complexity of Howard’s PI has remained elusive. For n-state, two-action MDPs, the best known lower and upper bounds are and iterations, respectively. Based on computational evidence for a combinatorial relaxation of this problem, Hansen [Hansen TD (2012) Worst-case analysis of strategy iteration and the simplex method. Unpublished PhD thesis, Aarhus University, Aarhus, Denmark] conjectured that the upper bound can be improved to , where is the golden ratio. We prove this conjecture for deterministic MDPs (DMDPs), albeit up to a poly(n) factor. More generally, we derive a nontrivial upper bound for DMDPs that applies to the entire family of PI algorithms. We also derive an improved bound that applies to all “max-gain” switching variants. These bounds hold both under discounted and average reward settings. Combined with a result of Melekopoglou and Condon [Melekopoglou M, Condon A (1994) On the complexity of the policy improvement algorithm for Markov decision processes. ORSA J. Comput. 6(2):188–192], our results imply that stochasticity makes two-action MDPs harder to solve for PI. Our analysis is based on certain graph-theoretic results, which may be of independent interest.

