Convergence of Finite Memory Q Learning for POMDPs and Near Optimality of Learned Policies Under Filter Stability
Published Online:21 Nov 2022https://doi.org/10.1287/moor.2022.1331
References
- [1] (1999) Convergence of Probability Measures, 2nd ed. (Wiley, New York).Crossref, Google Scholar
- [2] (1956) Central limit theorem for nonstationary Markov chains. I. Theory Probab. Appl. 1(1):65–80.Crossref, Google Scholar
- [3] (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(1):1–25.Google Scholar
- [4] (1982) Controlled Markov processes with arbitrary numerical criteria. Theory Probab. Appl. 27(3):486–503.Crossref, Google Scholar
- [5] (2012) Average cost Markov decision processes with weakly continuous transition probabilities. Math. Oper. Res. 37(4):591–607.Link, Google Scholar
- [6] (2016) Partially observable total-cost Markov decision process with weakly continuous transition probabilities. Math. Oper. Res. 41(2):656–681.Link, Google Scholar
- [7] (2022) Planning in observable POMDPs in quasipolynomial time. Preprint, submitted June 7, https://arxiv.org/abs/2201.04735.Google Scholar
- [8] (2013) Solving POMDPs by searching in policy space. Preprint, submitted January 30, https://arxiv.org/abs/1301.7380.Google Scholar
- [9] (1989) Adaptive Markov Control Processes (Springer-Verlag, New York).Crossref, Google Scholar
- [10] (1996) Discrete-Time Markov Control Processes: Basic Optimality Criteria (Springer, New York).Crossref, Google Scholar
- [11] (1994) On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6):1185–1201.Crossref, Google Scholar
- [12] (1995) Reinforcement learning algorithm for partially observable Markov decision problems. Adv. Neural Inform. Processing Systems 7:345–352.Google Scholar
- [13] (1947) On the notion of recurrence in discrete stochastic processes. Bull. Amer. Math. Soc. 53(10):1002–1010.Crossref, Google Scholar
- [14] (2022) Near optimality of finite memory feedback policies in partially observed Markov decision processes. J. Machine Learn. Res. 23(1):1–46.Google Scholar
- [15] (2019) Weak Feller property of non-linear filters. Systems Control Lett. 134:104512.Crossref, Google Scholar
- [16] (2016) Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [17] (2020) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Preprint, submitted June 4, https://arxiv.org/abs/2006.03041.Google Scholar
- [18] (1992) Memory approaches to reinforcement learning in non-Markovian domains. Report, Carnegie Mellon University, Pittsburgh.Google Scholar
- [19] (1991) A survey of algorithmic methods for partially observed Markov decision processes. Ann. Oper. Res. 28:47–66.Crossref, Google Scholar
- [20] (1997) Reinforcement learning with selective perception and hidden state. Unpublished doctoral dissertation, University of Rochester, NY.Google Scholar
- [21] (2018) Converse results on filter stability criteria and stochastic non-linear observability. Preprint, submitted December 5, https://arxiv.org/abs/1812.01772.Google Scholar
- [22] (2019) Observability and filter stability for partially observed Markov processes. IEEE 58th Conf. Decision and Control (IEEE, Piscataway, NJ), 1623–1628.Google Scholar
- [23] (2020) Exponential filter stability via Dobrushin’s coefficient. Electronic Comm. Probab. 25.Crossref, Google Scholar
- [24] (2022) Robustness to incorrect priors and controlled filter stability in partially observed stochastic control. SIAM J. Control Optim. 60(2).Crossref, Google Scholar
- [25] (1967) Probability Measures on Metric Spaces (AMS Bookstore, Providence, RI).Crossref, Google Scholar
- [26] (2006) Anytime point-based approximations for large POMDPs. J. Artificial Intelligence Res. 27(1):335–380.Crossref, Google Scholar
- [27] (2006) Point-based value iteration for continuous POMDPs. J. Machine Learn. Res. 7:2329–2367.Google Scholar
- [28] (1974) Incomplete information in Markovian decision models. Ann. Statist. 2(6):1327–1334.Crossref, Google Scholar
- [29] (2017) On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Math. Oper. Res. 42(4):945–978.Link, Google Scholar
- [30] (2020) Finite model approximations for partially observed Markov decision processes with discounted cost. IEEE Trans. Automatic Control 65(1):130–142.Crossref, Google Scholar
- [31] (1994) Learning without state-estimation in partially observable Markovian decision processes. Machine Learn. Proc. (Elsevier, Amsterdam), 284–292.Google Scholar
- [32] (2012) Point-based POMDP algorithms: Improved analysis and implementation. Preprint, submitted July 4, https://arxiv.org/abs/1207.1412.Google Scholar
- [33] (2019) Approximate information state for partially observed systems. IEEE 58th Conf. Decision Control, 1629–1636.Google Scholar
- [34] (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16:185–202.Crossref, Google Scholar
- [35] (2008) Optimal Transport: Old and New (Springer, New York).Google Scholar
- [36] (2005) Perseus: Randomized point-based value iteration for POMDPs. J. Artificial Intelligence Res. 24(1):195–220.Google Scholar
- [37] (2019) Stochastic approximation with cone-contractive operators: Sharper l∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
- [38] (1991) A survey of solution techniques for the partially observed Markov decision process. Ann. Oper. Res. 32:215–230.Crossref, Google Scholar
- [39] (1994) Finite-memory suboptimal design for partially observed Markov decision processes. Oper. Res. 42(3):439–455.Link, Google Scholar
- [40] (2008) On near optimality of the set of finite-state controllers for average cost POMDP. Math. Oper. Res. 33(1):1–11.Link, Google Scholar
- [41] (1976) Reduction of a controlled Markov model with incomplete data to a problem with complete information in the case of Borel state and control spaces. Theory Probab. Appl. 21(1):153–158.Crossref, Google Scholar
- [42] (2008) A density projection approach to dimension reduction for continuous-state POMDPs. 47th IEEE Conf. Decision Control, 5576–5581.Google Scholar
- [43] (2010) Solving continuous-state POMDPs via density projection. IEEE Trans. Automatic Control 55(5):1101–1116.Crossref, Google Scholar
- [44] (2001) An improved grid-based approximation algorithm for POMDPs. Internat. J. Conf. Artificial Intelligence, 707–714.Google Scholar

