Convergence of Finite Memory Q Learning for POMDPs and Near Optimality of Learned Policies Under Filter Stability

Published Online:https://doi.org/10.1287/moor.2022.1331

References

  • [1] Billingsley P (1999) Convergence of Probability Measures, 2nd ed. (Wiley, New York).CrossrefGoogle Scholar
  • [2] Dobrushin R (1956) Central limit theorem for nonstationary Markov chains. I. Theory Probab. Appl. 1(1):65–80.CrossrefGoogle Scholar
  • [3] Even-Dar E, Mansour Y, Bartlett P (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(1):1–25.Google Scholar
  • [4] Feinberg E (1982) Controlled Markov processes with arbitrary numerical criteria. Theory Probab. Appl. 27(3):486–503.CrossrefGoogle Scholar
  • [5] Feinberg E, Kasyanov P, Zadioanchuk N (2012) Average cost Markov decision processes with weakly continuous transition probabilities. Math. Oper. Res. 37(4):591–607.LinkGoogle Scholar
  • [6] Feinberg E, Kasyanov P, Zgurovsky M (2016) Partially observable total-cost Markov decision process with weakly continuous transition probabilities. Math. Oper. Res. 41(2):656–681.LinkGoogle Scholar
  • [7] Golowich N, Moitra A, Rohatgi D (2022) Planning in observable POMDPs in quasipolynomial time. Preprint, submitted June 7, https://arxiv.org/abs/2201.04735.Google Scholar
  • [8] Hansen EA (2013) Solving POMDPs by searching in policy space. Preprint, submitted January 30, https://arxiv.org/abs/1301.7380.Google Scholar
  • [9] Hernández-Lerma O (1989) Adaptive Markov Control Processes (Springer-Verlag, New York).CrossrefGoogle Scholar
  • [10] Hernández-Lerma O, Lasserre JB (1996) Discrete-Time Markov Control Processes: Basic Optimality Criteria (Springer, New York).CrossrefGoogle Scholar
  • [11] Jaakkola T, Jordan M, Singh S (1994) On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6):1185–1201.CrossrefGoogle Scholar
  • [12] Jaakkola T, Singh S, Jordan M (1995) Reinforcement learning algorithm for partially observable Markov decision problems. Adv. Neural Inform. Processing Systems 7:345–352.Google Scholar
  • [13] Kac M (1947) On the notion of recurrence in discrete stochastic processes. Bull. Amer. Math. Soc. 53(10):1002–1010.CrossrefGoogle Scholar
  • [14] Kara A, Yuksel S (2022) Near optimality of finite memory feedback policies in partially observed Markov decision processes. J. Machine Learn. Res. 23(1):1–46.Google Scholar
  • [15] Kara A, Saldi N, Yüksel S (2019) Weak Feller property of non-linear filters. Systems Control Lett. 134:104512.CrossrefGoogle Scholar
  • [16] Krishnamurthy V (2016) Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [17] Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Preprint, submitted June 4, https://arxiv.org/abs/2006.03041.Google Scholar
  • [18] Lin LJ, Mitchell TM (1992) Memory approaches to reinforcement learning in non-Markovian domains. Report, Carnegie Mellon University, Pittsburgh.Google Scholar
  • [19] Lovejoy W (1991) A survey of algorithmic methods for partially observed Markov decision processes. Ann. Oper. Res. 28:47–66.CrossrefGoogle Scholar
  • [20] McCallum A (1997) Reinforcement learning with selective perception and hidden state. Unpublished doctoral dissertation, University of Rochester, NY.Google Scholar
  • [21] McDonald C, Yüksel S (2018) Converse results on filter stability criteria and stochastic non-linear observability. Preprint, submitted December 5, https://arxiv.org/abs/1812.01772.Google Scholar
  • [22] McDonald C, Yüksel S (2019) Observability and filter stability for partially observed Markov processes. IEEE 58th Conf. Decision and Control (IEEE, Piscataway, NJ), 1623–1628.Google Scholar
  • [23] McDonald C, Yüksel S (2020) Exponential filter stability via Dobrushin’s coefficient. Electronic Comm. Probab. 25.CrossrefGoogle Scholar
  • [24] McDonald C, Yüksel S (2022) Robustness to incorrect priors and controlled filter stability in partially observed stochastic control. SIAM J. Control Optim. 60(2).CrossrefGoogle Scholar
  • [25] Parthasarathy K (1967) Probability Measures on Metric Spaces (AMS Bookstore, Providence, RI).CrossrefGoogle Scholar
  • [26] Pineau J, Gordon G, Thrun S (2006) Anytime point-based approximations for large POMDPs. J. Artificial Intelligence Res. 27(1):335–380.CrossrefGoogle Scholar
  • [27] Porta JM, Vlassis N, Spaan MTJ, Poupart P (2006) Point-based value iteration for continuous POMDPs. J. Machine Learn. Res. 7:2329–2367.Google Scholar
  • [28] Rhenius D (1974) Incomplete information in Markovian decision models. Ann. Statist. 2(6):1327–1334.CrossrefGoogle Scholar
  • [29] Saldi N, Yüksel S, Linder T (2017) On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Math. Oper. Res. 42(4):945–978.LinkGoogle Scholar
  • [30] Saldi N, Yüksel S, Linder T (2020) Finite model approximations for partially observed Markov decision processes with discounted cost. IEEE Trans. Automatic Control 65(1):130–142.CrossrefGoogle Scholar
  • [31] Singh S, Jaakkola T, Jordan M (1994) Learning without state-estimation in partially observable Markovian decision processes. Machine Learn. Proc. (Elsevier, Amsterdam), 284–292.Google Scholar
  • [32] Smith T, Simmons R (2012) Point-based POMDP algorithms: Improved analysis and implementation. Preprint, submitted July 4, https://arxiv.org/abs/1207.1412.Google Scholar
  • [33] Subramanian J, Mahajan A (2019) Approximate information state for partially observed systems. IEEE 58th Conf. Decision Control, 1629–1636.Google Scholar
  • [34] Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16:185–202.CrossrefGoogle Scholar
  • [35] Villani C (2008) Optimal Transport: Old and New (Springer, New York).Google Scholar
  • [36] Vlassis N, Spaan MTJ (2005) Perseus: Randomized point-based value iteration for POMDPs. J. Artificial Intelligence Res. 24(1):195–220.Google Scholar
  • [37] Wainwright M (2019) Stochastic approximation with cone-contractive operators: Sharper l∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
  • [38] White CC III (1991) A survey of solution techniques for the partially observed Markov decision process. Ann. Oper. Res. 32:215–230.CrossrefGoogle Scholar
  • [39] White CC III, Scherer WT (1994) Finite-memory suboptimal design for partially observed Markov decision processes. Oper. Res. 42(3):439–455.LinkGoogle Scholar
  • [40] Yu H, Bertsekas DP (2008) On near optimality of the set of finite-state controllers for average cost POMDP. Math. Oper. Res. 33(1):1–11.LinkGoogle Scholar
  • [41] Yushkevich A (1976) Reduction of a controlled Markov model with incomplete data to a problem with complete information in the case of Borel state and control spaces. Theory Probab. Appl. 21(1):153–158.CrossrefGoogle Scholar
  • [42] Zhou E, Fu MC, Marcus SI (2008) A density projection approach to dimension reduction for continuous-state POMDPs. 47th IEEE Conf. Decision Control, 5576–5581.Google Scholar
  • [43] Zhou E, Fu MC, Marcus SI (2010) Solving continuous-state POMDPs via density projection. IEEE Trans. Automatic Control 55(5):1101–1116.CrossrefGoogle Scholar
  • [44] Zhou R, Hansen E (2001) An improved grid-based approximation algorithm for POMDPs. Internat. J. Conf. Artificial Intelligence, 707–714.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.