Convergence of Finite Memory Q Learning for POMDPs and Near Optimality of Learned Policies Under Filter Stability

Ali Devran Kara
Corresponding Author
Ali Devran Kara
[email protected]
https://orcid.org/0000-0001-8119-6620
Department of Mathematics, University of Michigan, Ann Arbor, Michigan 48103;
Search for more papers by this author
,
Serdar Yüksel
Serdar Yüksel
[email protected]
Department of Mathematics and Statistics, Queen’s University, Kingston, Ontario K7L 3N6, Canada
Search for more papers by this author

Ali Devran Kara

Corresponding Author

Ali Devran Kara

[email protected]

https://orcid.org/0000-0001-8119-6620

Department of Mathematics, University of Michigan, Ann Arbor, Michigan 48103;

Search for more papers by this author

Serdar Yüksel

[email protected]

Department of Mathematics and Statistics, Queen’s University, Kingston, Ontario K7L 3N6, Canada

Search for more papers by this author

Published Online:21 Nov 2022https://doi.org/10.1287/moor.2022.1331

References

[1] Billingsley P (1999) Convergence of Probability Measures, 2nd ed. (Wiley, New York).Crossref, Google Scholar
[2] Dobrushin R (1956) Central limit theorem for nonstationary Markov chains. I. Theory Probab. Appl. 1(1):65–80.Crossref, Google Scholar
[3] Even-Dar E, Mansour Y, Bartlett P (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(1):1–25.Google Scholar
[4] Feinberg E (1982) Controlled Markov processes with arbitrary numerical criteria. Theory Probab. Appl. 27(3):486–503.Crossref, Google Scholar
[5] Feinberg E, Kasyanov P, Zadioanchuk N (2012) Average cost Markov decision processes with weakly continuous transition probabilities. Math. Oper. Res. 37(4):591–607.Link, Google Scholar
[6] Feinberg E, Kasyanov P, Zgurovsky M (2016) Partially observable total-cost Markov decision process with weakly continuous transition probabilities. Math. Oper. Res. 41(2):656–681.Link, Google Scholar
[7] Golowich N, Moitra A, Rohatgi D (2022) Planning in observable POMDPs in quasipolynomial time. Preprint, submitted June 7, https://arxiv.org/abs/2201.04735.Google Scholar
[8] Hansen EA (2013) Solving POMDPs by searching in policy space. Preprint, submitted January 30, https://arxiv.org/abs/1301.7380.Google Scholar
[9] Hernández-Lerma O (1989) Adaptive Markov Control Processes (Springer-Verlag, New York).Crossref, Google Scholar
[10] Hernández-Lerma O, Lasserre JB (1996) Discrete-Time Markov Control Processes: Basic Optimality Criteria (Springer, New York).Crossref, Google Scholar
[11] Jaakkola T, Jordan M, Singh S (1994) On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6):1185–1201.Crossref, Google Scholar
[12] Jaakkola T, Singh S, Jordan M (1995) Reinforcement learning algorithm for partially observable Markov decision problems. Adv. Neural Inform. Processing Systems 7:345–352.Google Scholar
[13] Kac M (1947) On the notion of recurrence in discrete stochastic processes. Bull. Amer. Math. Soc. 53(10):1002–1010.Crossref, Google Scholar
[14] Kara A, Yuksel S (2022) Near optimality of finite memory feedback policies in partially observed Markov decision processes. J. Machine Learn. Res. 23(1):1–46.Google Scholar
[15] Kara A, Saldi N, Yüksel S (2019) Weak Feller property of non-linear filters. Systems Control Lett. 134:104512.Crossref, Google Scholar
[16] Krishnamurthy V (2016) Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[17] Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Preprint, submitted June 4, https://arxiv.org/abs/2006.03041.Google Scholar
[18] Lin LJ, Mitchell TM (1992) Memory approaches to reinforcement learning in non-Markovian domains. Report, Carnegie Mellon University, Pittsburgh.Google Scholar
[19] Lovejoy W (1991) A survey of algorithmic methods for partially observed Markov decision processes. Ann. Oper. Res. 28:47–66.Crossref, Google Scholar
[20] McCallum A (1997) Reinforcement learning with selective perception and hidden state. Unpublished doctoral dissertation, University of Rochester, NY.Google Scholar
[21] McDonald C, Yüksel S (2018) Converse results on filter stability criteria and stochastic non-linear observability. Preprint, submitted December 5, https://arxiv.org/abs/1812.01772.Google Scholar
[22] McDonald C, Yüksel S (2019) Observability and filter stability for partially observed Markov processes. IEEE 58th Conf. Decision and Control (IEEE, Piscataway, NJ), 1623–1628.Google Scholar
[23] McDonald C, Yüksel S (2020) Exponential filter stability via Dobrushin’s coefficient. Electronic Comm. Probab. 25.Crossref, Google Scholar
[24] McDonald C, Yüksel S (2022) Robustness to incorrect priors and controlled filter stability in partially observed stochastic control. SIAM J. Control Optim. 60(2).Crossref, Google Scholar
[25] Parthasarathy K (1967) Probability Measures on Metric Spaces (AMS Bookstore, Providence, RI).Crossref, Google Scholar
[26] Pineau J, Gordon G, Thrun S (2006) Anytime point-based approximations for large POMDPs. J. Artificial Intelligence Res. 27(1):335–380.Crossref, Google Scholar
[27] Porta JM, Vlassis N, Spaan MTJ, Poupart P (2006) Point-based value iteration for continuous POMDPs. J. Machine Learn. Res. 7:2329–2367.Google Scholar
[28] Rhenius D (1974) Incomplete information in Markovian decision models. Ann. Statist. 2(6):1327–1334.Crossref, Google Scholar
[29] Saldi N, Yüksel S, Linder T (2017) On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Math. Oper. Res. 42(4):945–978.Link, Google Scholar
[30] Saldi N, Yüksel S, Linder T (2020) Finite model approximations for partially observed Markov decision processes with discounted cost. IEEE Trans. Automatic Control 65(1):130–142.Crossref, Google Scholar
[31] Singh S, Jaakkola T, Jordan M (1994) Learning without state-estimation in partially observable Markovian decision processes. Machine Learn. Proc. (Elsevier, Amsterdam), 284–292.Google Scholar
[32] Smith T, Simmons R (2012) Point-based POMDP algorithms: Improved analysis and implementation. Preprint, submitted July 4, https://arxiv.org/abs/1207.1412.Google Scholar
[33] Subramanian J, Mahajan A (2019) Approximate information state for partially observed systems. IEEE 58th Conf. Decision Control, 1629–1636.Google Scholar
[34] Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16:185–202.Crossref, Google Scholar
[35] Villani C (2008) Optimal Transport: Old and New (Springer, New York).Google Scholar
[36] Vlassis N, Spaan MTJ (2005) Perseus: Randomized point-based value iteration for POMDPs. J. Artificial Intelligence Res. 24(1):195–220.Google Scholar
[37] Wainwright M (2019) Stochastic approximation with cone-contractive operators: Sharper l∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
[38] White CC III (1991) A survey of solution techniques for the partially observed Markov decision process. Ann. Oper. Res. 32:215–230.Crossref, Google Scholar
[39] White CC III, Scherer WT (1994) Finite-memory suboptimal design for partially observed Markov decision processes. Oper. Res. 42(3):439–455.Link, Google Scholar
[40] Yu H, Bertsekas DP (2008) On near optimality of the set of finite-state controllers for average cost POMDP. Math. Oper. Res. 33(1):1–11.Link, Google Scholar
[41] Yushkevich A (1976) Reduction of a controlled Markov model with incomplete data to a problem with complete information in the case of Borel state and control spaces. Theory Probab. Appl. 21(1):153–158.Crossref, Google Scholar
[42] Zhou E, Fu MC, Marcus SI (2008) A density projection approach to dimension reduction for continuous-state POMDPs. 47th IEEE Conf. Decision Control, 5576–5581.Google Scholar
[43] Zhou E, Fu MC, Marcus SI (2010) Solving continuous-state POMDPs via density projection. IEEE Trans. Automatic Control 55(5):1101–1116.Crossref, Google Scholar
[44] Zhou R, Hansen E (2001) An improved grid-based approximation algorithm for POMDPs. Internat. J. Conf. Artificial Intelligence, 707–714.Google Scholar

cover image Mathematics of Operations Research

Volume 48, Issue 4

November 2023

Pages 1811-2382, C2

Article Information

Metrics

Information

Received:February 16, 2021
Accepted:October 11, 2022
Published Online:November 21, 2022

Cite as

Ali Devran Kara, Serdar Yüksel (2022) Convergence of Finite Memory Q Learning for POMDPs and Near Optimality of Learned Policies Under Filter Stability. Mathematics of Operations Research 48(4):2066-2093.

https://doi.org/10.1287/moor.2022.1331

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Convergence of Finite Memory Q Learning for POMDPs and Near Optimality of Learned Policies Under Filter Stability

References

Volume 48, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News