Off-line Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

Published Online:https://doi.org/10.1287/opre.2023.0046

References

  • Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Thirty Third Conf. Learn. Theory (PMLR, New York), 67–83.Google Scholar
  • Armony M, Israelit S, Mandelbaum A, Marmor YN, Tseytlin Y, Yom-Tov GB (2015) On patient flow in hospitals: A data-based queueing-science perspective. Stochastic Systems 5(1):146–194.LinkGoogle Scholar
  • Banerjee I, Rao VA, Honnappa H (2021) PAC-Bayes bounds on variational tempered posteriors for Markov models. Entropy 23(3):313.CrossrefGoogle Scholar
  • Bertsekas DP (2011) Dynamic Programming and Optimal Control, 3rd ed., vol. II (Athena Scientific, Belmont, MA).Google Scholar
  • Billingsley P (1961) Statistical methods in Markov chains. Ann. Math. Statist. 32(1):12–40.CrossrefGoogle Scholar
  • Borkar VS (1991) Topics in Controlled Markov Chains (Longman, Harlow, UK).Google Scholar
  • Bradley RC (2005) Basic properties of strong mixing conditions: A survey and some open questions. Probab. Surveys 2:107–144.CrossrefGoogle Scholar
  • Brandwein AC, Strawderman WE (1980) Minimax estimation of location parameters for spherically symmetric distributions with concave loss. Ann. Statist. 8(2):279–284.CrossrefGoogle Scholar
  • Chen IY, Joshi S, Ghassemi M, Ranganath R (2021) Probabilistic machine learning for healthcare. Annual Rev. Biomedical Data Sci. 4:393–415.CrossrefGoogle Scholar
  • Chin R, Maass AI, Ulapane N, Manzie C, Shames I, Nešić D, Rowe JE, Nakada H (2020) Active learning for linear parameter-varying system identification. IFAC-PapersOnLine 53(2):989–994.CrossrefGoogle Scholar
  • Dobrushin RL (1956a) Central limit theorem for nonstationary Markov chains. I. Theory Probab. Appl. 1(1):65–80.CrossrefGoogle Scholar
  • Dobrushin RL (1956b) Central limit theorem for nonstationary Markov chains. II. Theory Probab. Appl. 1(4):329–383.CrossrefGoogle Scholar
  • Fourdrinier D, Mezoued F, Strawderman WE (2013) Bayes minimax estimation under power priors of location parameters for a wide class of spherically symmetric distributions. Electronic J. Statist. 7:717–741.CrossrefGoogle Scholar
  • Geyer CJ (1998) Markov chain Monte Carlo lecture notes. Course Notes, Spring Quarter 80, Minneapolis.Google Scholar
  • Goldberg DA, Reiman MI, Wang Q (2021) A survey of recent progress in the asymptotic analysis of inventory systems. Production Oper. Management 30(6):1718–1750.Google Scholar
  • Gut A, Gut A (2005) Probability: A Graduate Course, vol. 5 (Springer, New York).Google Scholar
  • Hernández-Lerma O, Montes-de Oca R, Cavazos-Cadena R (1991) Recurrence conditions for Markov decision processes with Borel state space: A survey. Ann. Oper. Res. 28(1):29–46.CrossrefGoogle Scholar
  • Icarte RT, Klassen T, Valenzano R, McIlraith S (2018) Using reward machines for high-level task specification and decomposition in reinforcement learning. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 2107–2116.Google Scholar
  • Jones GL (2004) On the Markov chain central limit theorem. Probab. Surveys 1:299–320.CrossrefGoogle Scholar
  • Khamaru K, Xia E, Wainwright MJ, Jordan MI (2022) Instance-dependent confidence and early stopping for reinforcement learning. Preprint, submitted January 1, https://arxiv.org/abs/2201.08536.Google Scholar
  • Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Adv. Neural Inform. Processing Systems 33:21810–21823.Google Scholar
  • Kontorovich LA, Ramanan K (2008) Concentration inequalities for dependent random variables via the martingale method. Ann. Probab. 36(6):2126–2158.CrossrefGoogle Scholar
  • Laroche R, Tachet Des Combes R (2023) On the occupancy measure of non-Markovian policies in continuous MDPs. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato, S Scarlett J, eds. Proc. 40th Internat. Conf. Machine Learn., vol. 202 (PMLR, New York), 18548–18562.Google Scholar
  • Laroche R, Combes RTD, Buckman J (2022) Non-Markovian policies occupancy measures. Preprint, submitted May 27, https://arxiv.org/abs/2205.13950.Google Scholar
  • Lee HR, Lee T (2018) Markov decision process model for patient admission decision at an emergency department under a surge demand. Flexible Services Manufacturing J. 30(1):98–122.Google Scholar
  • Lehmann EL, Casella G (1998) Theory of Point Estimation, 2nd ed. (Springer, New York), 150.Google Scholar
  • Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
  • Li Y, Wang R, Yang LF (2022b) Settling the horizon-dependence of sample complexity in reinforcement learning. 2021 IEEE 62nd Annual Sympos. Foundations Computer Sci. (IEEE, Piscataway, NJ), 965–976.Google Scholar
  • Li G, Shi L, Chen Y, Chi Y, Wei Y (2022a) Settling the sample complexity of model-based offline reinforcement learning. Preprint, submitted April 11, https://arxiv.org/abs/2204.05275.Google Scholar
  • Liu S, See KC, Ngiam KY, Celi LA, Sun X, Feng M (2020) Reinforcement learning for clinical decision support in critical care: Comprehensive review. J. Medical Internet Res. 22(7):e18477.CrossrefGoogle Scholar
  • Ljung L (1987) System Identification: Theory for the User (Prentice-Hall, Englewood Cliffs, NJ).Google Scholar
  • Ljung L (2010) Perspectives on system identification. Annual Rev. Control 34(1):1–12.CrossrefGoogle Scholar
  • Löffler M, Picard A (2021) Spectral thresholding for the estimation of Markov chain transition operators. Electronic J. Statist. 15(2):6281–6310.CrossrefGoogle Scholar
  • Mania H, Jordan MI, Recht B (2020) Active learning for nonlinear system identification with guarantees. Preprint, submitted June 18, https://arxiv.org/abs/2006.10277.Google Scholar
  • Mannor S, Tsitsiklis JN (2005) On the empirical state-action frequencies in Markov decision processes under general policies. Math. Oper. Res. 30(3):545–561.LinkGoogle Scholar
  • Meyn SP, Tweedie RL (2012) Markov Chains and Stochastic Stability (Springer Science & Business Media, London).Google Scholar
  • Mou W, Pananjady A, Wainwright MJ, Bartlett PL (2021) Optimal and instance-dependent guarantees for Markovian linear stochastic approximation. Preprint, submitted December 23, https://arxiv.org/abs/2112.12770.Google Scholar
  • Mukhamedov F (2013) The Dobrushin ergodicity coefficient and the ergodicity of noncommutative Markov chains. J. Math. Anal. Appl. 408(1):364–373.CrossrefGoogle Scholar
  • Mutti M, De Santi R, Restelli M (2022) The importance of non-Markovianity in maximum state entropy exploration. Preprint, submitted February 7, https://arxiv.org/abs/2202.03060.Google Scholar
  • Pinto L, Davidson J, Sukthankar R, Gupta A (2017) Robust adversarial reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 2817–2826.Google Scholar
  • RayChaudhuri T, Hamey LG (1996) Active learning for nonlinear system identification and control. IFAC Proc. Volumes 29(1):2592–2596.CrossrefGoogle Scholar
  • Rosenblatt-Roth M (1963) Some theorems concerning the law of large numbers for non-homogeneous Markoff chains. Zeitschrift Wahrscheinlichkeitstheorie Verwandte Gebiete 1(5):433–445.CrossrefGoogle Scholar
  • Rosenblatt-Roth M (1964) Some theorems concerning the strong law of large numbers for non-homogeneous Markov chains. Ann. Math. Statist. 35(2):566–576.CrossrefGoogle Scholar
  • Rosenthal JS (1995) Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 90(430):558–566.CrossrefGoogle Scholar
  • Sart M (2014) Estimation of the transition density of a Markov chain. Annales l’IHP Probabilités Statistiques 50(3):1028–1068.Google Scholar
  • Shortreed SM, Laber E, Lizotte DJ, Stroup TS, Pineau J, Murphy SA (2011) Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine Learn. 84(1):109–136.CrossrefGoogle Scholar
  • Showkatbakhsh M, Tabuada P, Diggavi S (2016) System identification in the presence of adversarial outputs. 2016 IEEE 55th Conf. Decision Control (IEEE, Piscataway, NJ), 7177–7182.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Tangirala AK (2018) Principles of System Identification: Theory and Practice (CRC Press, Boca Raton, FL).CrossrefGoogle Scholar
  • Tsybakov AB (2009) Introduction to Nonparametric Estimation (Springer, New York).CrossrefGoogle Scholar
  • van Eeden C (2006) Minimax estimators and their admissibility. Restricted Parameter Space Estimation Problems: Admissibility and Minimaxity Properties (Springer, New York), 33–67.CrossrefGoogle Scholar
  • Vidyasagar M, Karandikar RL (2006) A learning theory approach to system identification and stochastic adaptive control. Probabilistic and Randomized Methods for Design Under Uncertainty (Springer, London), 265–302.CrossrefGoogle Scholar
  • Wang S, Si N, Blanchet J, Zhou Z (2023) On the foundation of distributionally robust reinforcement learning. Preprint, submitted November 15, https://arxiv.org/abs/2311.09018.Google Scholar
  • Wolfer G, Kontorovich A (2021) Statistical estimation of ergodic Markov chain kernel over discrete state space. Bernoulli 27(1):532–553.CrossrefGoogle Scholar
  • Yakowitz S (1979) Nonparametric estimation of Markov transition functions. Ann. Statist. 7(3):671–679.CrossrefGoogle Scholar
  • Yakowitz S (1989) Nonparametric density and regression estimation for Markov sequences without mixing assumptions. J. Multivariate Anal. 30(1):124–136.CrossrefGoogle Scholar
  • Yan Y, Li G, Chen Y, Fan J (2022) Model-based reinforcement learning is minimax-optimal for offline zero-sum Markov games. Preprint, submitted June 8, https://arxiv.org/abs/2206.04044.Google Scholar
  • Yu C, Liu J, Nemati S, Yin G (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Surveys 55(1):1–36.CrossrefGoogle Scholar
  • Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) MOPO: Model-based offline policy optimization. Adv. Neural Inform. Processing Systems 33:14129–14142.Google Scholar
  • Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Internat. Conf. Machine Learn. (PMLR, New York), 7304–7312.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.