A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies

Huizhen Yu
Huizhen Yu
[email protected]
Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2R3
Search for more papers by this author
,
Dimitri P. Bertsekas
Dimitri P. Bertsekas
[email protected]
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author

Huizhen Yu

[email protected]

Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2R3

Search for more papers by this author

Dimitri P. Bertsekas

[email protected]

Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Published Online:1 Apr 2015https://doi.org/10.1287/moor.2014.0704

References

Altman E (1999) Constrained Markov Decision Processes (Chapman & Hall/CRC, Boca Raton, FL).Google Scholar
Arapostathis A, Borkar VS, Fernández-Gaucherand E, Ghosh MK, Marcus SI (1993) Discrete-time controlled Markov processes with average cost criterion: A survey. SIAM J. Control Optim. 31:282–344.Crossref, Google Scholar
Bertsekas DP (1972) Infinite time reachability of state space regions by using feedback control. IEEE Trans. Automatic Control AC-17:604–613.Crossref, Google Scholar
Bertsekas DP (1977) Monotone mappings with application in dynamic programming. SIAM J. Control Optim. 15:438–464.Crossref, Google Scholar
Bertsekas DP (2012) Dynamic Programming and Optimal Control, 4th ed., Vol. II (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP (2013) Abstract Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP, Shreve SE (1978) Stochastic Optimal Control: The Discrete Time Case (Academic Press, New York).Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP, Yu H (2010) Distributed asynchronous policy iteration in dynamic programming. Proc. 48th Allerton Conf. Comm., Control Comput. (IEEE, Piscataway, NJ), 1368–1375.Crossref, Google Scholar
Bertsekas DP, Yu H (2012) Q-learning and enhanced policy iteration in discounted dynamic programming. Math. Oper. Res. 37(1):66–94.Link, Google Scholar
Blackwell D (1964) Memoryless strategies in finite stage dynamic programming. Ann. Math. Statist. 35:863–865.Crossref, Google Scholar
Blackwell D (1965a) Discounted dynamic programming. Ann. Math. Statist. 36:226–235.Crossref, Google Scholar
Blackwell D (1965b) Positive dynamic programming. Proc. 5th Berkeley Sympos. Math. Satist. Probab. (University of California Press, Berkeley, CA), 415–418.Google Scholar
Blackwell D (1968) A Borel set not containing a graph. Ann. Math. Statist. 39:1345–1347.Crossref, Google Scholar
Blackwell D (1978) Borel-programmable functions. Ann. Probab. 6:321–324.Crossref, Google Scholar
Blackwell D, Ryll-Nardzewski C (1963) Non-existence of everywhere proper conditional distributions. Ann. Math. Statist. 34:223–225.Crossref, Google Scholar
Blackwell D, Freedman D, Orkin M (1974) The optimal reward operator in dynamic programming. Ann. Probab. 2:926–941.Crossref, Google Scholar
Chen RR, Meyn S (1999) Value iteration and optimization of multiclass queueing networks. Queueing Systems 32:65–97.Crossref, Google Scholar
Dudley RM (2002) Real Analysis and Probability (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Dynkin EB, Yushkevich AA (1979) Controlled Markov Processes (Springer, New York).Crossref, Google Scholar
Feinberg EA (2002) Total reward criteria. Feinberg EA, Shwartz A, eds. Handbook of Markov Decision Processes (Springer, New York), 155–189.Crossref, Google Scholar
Feinberg EA, Shwartz A, eds. (2002) Handbook of Markov Decision Processes (Springer, New York).Crossref, Google Scholar
Feinberg EA, Kasyanov PO, Zadoianchuk NV (2012) Average cost Markov decision processes with weakly continuous transition probabilities. Math. Oper. Res. 37(4):591–607.Link, Google Scholar
Freedman D (1974) The optimal reward operator in special classes of dynamic programming problems. Ann. Probability 2:942–949.Crossref, Google Scholar
Furukawa N (1972) Markovian decision processes with compact action spaces. Ann. Math. Statist. 43:1612–1622.Crossref, Google Scholar
Hartley R (1980) A simple proof of Whittle’s bridging condition in dynamic programming. J. Appl. Prob. 17:1114–1116.Crossref, Google Scholar
Hernández-Lerma O, Lasserre JB (1996) Discrete-Time Markov Control Processes: Basic Optimality Criteria (Springer, New York).Crossref, Google Scholar
Hernández-Lerma O, Lasserre JB (1999) Further Topics on Discrete-Time Markov Control Processes (Springer, New York).Crossref, Google Scholar
Hinderer K (1970) Foundations of Non-Stationary Dynamic Programming with Discrete Time Parameter (Springer, New York).Crossref, Google Scholar
Kreps DM, Porteus EL (1977) On the optimality of structured policies in countable stage decision processes. II: Positive and negative problems. SIAM J. Appl. Math. 32:457–466.Crossref, Google Scholar
Kuratowski K (1966) Topology I (Academic Press, New York).Google Scholar
Maitra A (1968) Discounted dynamic programming on compact metric spaces. Sankhyā: Indian J. Statist., Ser. A 30:211–216.Google Scholar
Maitra A, Sudderth W (1992) The optimal reward operator in negative dynamic programming. Math. Oper. Res. 17(4):921–931.Link, Google Scholar
Meyn S (2008) Control Techniques for Complex Networks (Cambridge University Press, Cambridge, UK).Google Scholar
Meyn S, Tweedie RL (2009) Markov Chains and Stochastic Stability, 2nd ed. (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Miller BL, Veinott AF (1969) Discrete dynamic programming with a small interest rate. Ann. Math. Statist. 40:366–370.Crossref, Google Scholar
Neveu J (1975) Discrete-Parameter Martingales (North-Holland, Amsterdam).Google Scholar
Parthasarathy KR (1967) Probability Measures on Metric Spaces (Academic Press, New York).Crossref, Google Scholar
Puterman ML (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Crossref, Google Scholar
Rudin W (1976) Principles of Mathematical Analysis, 3rd ed. (McGraw-Hill, New York).Google Scholar
Schäl M (1975) Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Z. Wahrscheinlichkeitstheorie verw. Gebiete 32:179–196.Crossref, Google Scholar
Shreve SE (1978) Probability measures and the C-set of Selivanovskij. Pacific J. Math. 79:189–196.Crossref, Google Scholar
Shreve SE (1979) Resolution of measurability problems in discrete-time stochastic control. Stochastic Control Theory and Stochastic Differential Systems (Springer, Berlin), 580–587.Crossref, Google Scholar
Shreve SE (1981) Borel-approachable functions. Fundamenta Mathematicae 112:17–24.Google Scholar
Shreve SE, Bertsekas DP (1978) Alternative theoretical frameworks for finite horizon discrete-time stochastic optimal control. SIAM J. Control Optim. 16:953–977.Crossref, Google Scholar
Shreve SE, Bertsekas DP (1979) Universally measurable policies in dynamic programming. Math. Oper. Res. 4(1):15–30.Link, Google Scholar
Srivastava SM (1998) A Course on Borel Sets (Springer, New York).Crossref, Google Scholar
Strauch RE (1966) Negative dynamic programming. Ann. Math. Statist. 37:871–890.Crossref, Google Scholar
Sutton RS, Barto AG (1998) Reinforcement Learning (MIT Press, Cambridge).Google Scholar
van der Wal J (1981) Stochastic Dynamic Programming (Mathematical Centre, Amsterdam).Google Scholar
Veinott AF (1966) On finding optimal policies in discrete dynamic programming with no discounting. Ann. Math. Statist. 37: 1284–1294.Crossref, Google Scholar
Veinott AF (1969) On discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Statist. 40:1635–1660.Crossref, Google Scholar
Whittle P (1979) A simple condition for regularity in negative programming. J. Appl. Prob. 16:305–318.Crossref, Google Scholar
Whittle P (1980) Stability and characterisation conditions in negative programming. J. Appl. Prob. 17:635–645.Crossref, Google Scholar
Yu H (2014) On convergence of value iteration for a class of total cost Markov decision processes. Unpublished report, http://arxiv.org/abs/1411.1459.Google Scholar
Yu H, Bertsekas DP (2013) Q-learning and policy iteration algorithms for stochastic shortest path problems. Ann. Oper. Res. 208: 95–132.Crossref, Google Scholar

cover image Mathematics of Operations Research

Volume 40, Issue 4

November 2015

Pages 797-1088

Article Information

Metrics

Information

Received:August 28, 2013
Published Online:April 01, 2015

Cite as

Huizhen Yu, Dimitri P. Bertsekas (2015) A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies. Mathematics of Operations Research 40(4):926-968.

https://doi.org/10.1287/moor.2014.0704

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies

References

Volume 40, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News