Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Dimitri P. Bertsekas
Dimitri P. Bertsekas
[email protected]
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author
,
Huizhen Yu
Huizhen Yu
[email protected]
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author

Dimitri P. Bertsekas

[email protected]

Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Huizhen Yu

[email protected]

Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Published Online:13 Jan 2012https://doi.org/10.1287/moor.1110.0532

Abstract

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.

cover image Mathematics of Operations Research

Volume 37, Issue 1

February 2012

Pages 1-200

Article Information

Metrics

Information

Received:October 14, 2010
Published Online:January 13, 2012

Cite as

Dimitri P. Bertsekas, Huizhen Yu, (2012) Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming. Mathematics of Operations Research 37(1):66-94.

https://doi.org/10.1287/moor.1110.0532

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Abstract

Volume 37, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News