Polynomial-Time Computation of Strong and n-Present-Value Optimal Policies in Markov Decision Chains

Michael O’Sullivan
Corresponding Author
Michael O’Sullivan
[email protected]
Department of Engineering Science, University of Auckland, Auckland 1142, New Zealand
Search for more papers by this author
,
Arthur F. Veinott, Jr.
Arthur F. Veinott, Jr.
Formerly at the Department of Management Science and Engineering, Stanford University, Stanford California
Search for more papers by this author

Michael O’Sullivan

Corresponding Author

Michael O’Sullivan

[email protected]

Department of Engineering Science, University of Auckland, Auckland 1142, New Zealand

Search for more papers by this author

Arthur F. Veinott, Jr. (deceased)

Arthur F. Veinott, Jr.

Formerly at the Department of Management Science and Engineering, Stanford University, Stanford California

Search for more papers by this author

Published Online:14 Dec 2016https://doi.org/10.1287/moor.2016.0812

Abstract

This paper studies the problem of finding a stationary strong present-value optimal and, more generally, an n-present-value optimal, policy for an infinite-horizon stationary finite-state-action substochastic Markov decision chain. A policy is strong present-value optimal if it has maximum present value for all small positive interest rates ρ. A policy is n-present-value optimal if its present value falls below the maximum present value for all small positive ρ by O(ρⁿ⁺¹). The importance of stationary n-present-value optimal policies arises from the following known facts. The set of such policies diminishes with n and, for n ≥ S, is precisely the set of stationary strong present-value optimal policies. For 0 ≤ n < S, stationary n-present-value optimal policies are nearly strong present-value optimal and are of independent interest. The best algorithms for finding stationary strong present-value optimal policies find stationary n-present-value policies for n = −1,…,S in that order. This paper decomposes the problem of finding a stationary n-present-value optimal policy given a stationary (n − 1)-present-value optimal policy into a sequence of three subproblems, each entailing either maximizing transient value or reward rate. It is known that both subproblem types can be represented as linear programs and solved in polynomial time, e.g., by interior-point and ellipsoid methods. This paper also establishes the following results. The size of the input to each subproblem is polynomial in the size of the original problem data. A stationary strong present-value (respectively, n-present-value) optimal policy can be found in polynomial time. For the case of unique-transition systems, i.e., each action in a state sends the system to at most one state, a stationary strong present-value (respectively, n-present-value) optimal policy can be found in strongly polynomial time using combinatorial methods on the subproblems. The last case includes standard deterministic dynamic programs.

cover image Mathematics of Operations Research

Volume 42, Issue 3

August 2017

Pages 577-896

Article Information

Metrics

Information

Received:January 22, 2015
Accepted:June 07, 2016
Published Online:December 14, 2016

Cite as

Michael O’Sullivan, Arthur F. Veinott, Jr. (2016) Polynomial-Time Computation of Strong and n-Present-Value Optimal Policies in Markov Decision Chains. Mathematics of Operations Research 42(3):577-598.

https://doi.org/10.1287/moor.2016.0812

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Polynomial-Time Computation of Strong and n-Present-Value Optimal Policies in Markov Decision Chains

Abstract

Volume 42, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News