Off-line Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity
References
- Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Thirty Third Conf. Learn. Theory (PMLR, New York), 67–83.Google Scholar
- (2015) On patient flow in hospitals: A data-based queueing-science perspective. Stochastic Systems 5(1):146–194.Link, Google Scholar
- (2021) PAC-Bayes bounds on variational tempered posteriors for Markov models. Entropy 23(3):313.Crossref, Google Scholar
- (2011) Dynamic Programming and Optimal Control, 3rd ed., vol. II (Athena Scientific, Belmont, MA).Google Scholar
- (1961) Statistical methods in Markov chains. Ann. Math. Statist. 32(1):12–40.Crossref, Google Scholar
- (1991) Topics in Controlled Markov Chains (Longman, Harlow, UK).Google Scholar
- (2005) Basic properties of strong mixing conditions: A survey and some open questions. Probab. Surveys 2:107–144.Crossref, Google Scholar
- (1980) Minimax estimation of location parameters for spherically symmetric distributions with concave loss. Ann. Statist. 8(2):279–284.Crossref, Google Scholar
- (2021) Probabilistic machine learning for healthcare. Annual Rev. Biomedical Data Sci. 4:393–415.Crossref, Google Scholar
- (2020) Active learning for linear parameter-varying system identification. IFAC-PapersOnLine 53(2):989–994.Crossref, Google Scholar
- (1956a) Central limit theorem for nonstationary Markov chains. I. Theory Probab. Appl. 1(1):65–80.Crossref, Google Scholar
- (1956b) Central limit theorem for nonstationary Markov chains. II. Theory Probab. Appl. 1(4):329–383.Crossref, Google Scholar
- (2013) Bayes minimax estimation under power priors of location parameters for a wide class of spherically symmetric distributions. Electronic J. Statist. 7:717–741.Crossref, Google Scholar
- (1998) Markov chain Monte Carlo lecture notes. Course Notes, Spring Quarter 80, Minneapolis.Google Scholar
- Goldberg DA, Reiman MI, Wang Q (2021) A survey of recent progress in the asymptotic analysis of inventory systems. Production Oper. Management 30(6):1718–1750.Google Scholar
- (2005) Probability: A Graduate Course, vol. 5 (Springer, New York).Google Scholar
- (1991) Recurrence conditions for Markov decision processes with Borel state space: A survey. Ann. Oper. Res. 28(1):29–46.Crossref, Google Scholar
- Icarte RT, Klassen T, Valenzano R, McIlraith S (2018) Using reward machines for high-level task specification and decomposition in reinforcement learning. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 2107–2116.Google Scholar
- (2004) On the Markov chain central limit theorem. Probab. Surveys 1:299–320.Crossref, Google Scholar
- (2022) Instance-dependent confidence and early stopping for reinforcement learning. Preprint, submitted January 1, https://arxiv.org/abs/2201.08536.Google Scholar
- (2020) Morel: Model-based offline reinforcement learning. Adv. Neural Inform. Processing Systems 33:21810–21823.Google Scholar
- (2008) Concentration inequalities for dependent random variables via the martingale method. Ann. Probab. 36(6):2126–2158.Crossref, Google Scholar
- (2023) On the occupancy measure of non-Markovian policies in continuous MDPs. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato, S Scarlett J, eds. Proc. 40th Internat. Conf. Machine Learn., vol. 202 (PMLR, New York), 18548–18562.Google Scholar
- (2022) Non-Markovian policies occupancy measures. Preprint, submitted May 27, https://arxiv.org/abs/2205.13950.Google Scholar
- Lee HR, Lee T (2018) Markov decision process model for patient admission decision at an emergency department under a surge demand. Flexible Services Manufacturing J. 30(1):98–122.Google Scholar
- Lehmann EL, Casella G (1998) Theory of Point Estimation, 2nd ed. (Springer, New York), 150.Google Scholar
- (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
- (2022b) Settling the horizon-dependence of sample complexity in reinforcement learning. 2021 IEEE 62nd Annual Sympos. Foundations Computer Sci. (IEEE, Piscataway, NJ), 965–976.Google Scholar
- (2022a) Settling the sample complexity of model-based offline reinforcement learning. Preprint, submitted April 11, https://arxiv.org/abs/2204.05275.Google Scholar
- (2020) Reinforcement learning for clinical decision support in critical care: Comprehensive review. J. Medical Internet Res. 22(7):e18477.Crossref, Google Scholar
- Ljung L (1987) System Identification: Theory for the User (Prentice-Hall, Englewood Cliffs, NJ).Google Scholar
- (2010) Perspectives on system identification. Annual Rev. Control 34(1):1–12.Crossref, Google Scholar
- (2021) Spectral thresholding for the estimation of Markov chain transition operators. Electronic J. Statist. 15(2):6281–6310.Crossref, Google Scholar
- (2020) Active learning for nonlinear system identification with guarantees. Preprint, submitted June 18, https://arxiv.org/abs/2006.10277.Google Scholar
- (2005) On the empirical state-action frequencies in Markov decision processes under general policies. Math. Oper. Res. 30(3):545–561.Link, Google Scholar
- (2012) Markov Chains and Stochastic Stability (Springer Science & Business Media, London).Google Scholar
- (2021) Optimal and instance-dependent guarantees for Markovian linear stochastic approximation. Preprint, submitted December 23, https://arxiv.org/abs/2112.12770.Google Scholar
- (2013) The Dobrushin ergodicity coefficient and the ergodicity of noncommutative Markov chains. J. Math. Anal. Appl. 408(1):364–373.Crossref, Google Scholar
- (2022) The importance of non-Markovianity in maximum state entropy exploration. Preprint, submitted February 7, https://arxiv.org/abs/2202.03060.Google Scholar
- (2017) Robust adversarial reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 2817–2826.Google Scholar
- (1996) Active learning for nonlinear system identification and control. IFAC Proc. Volumes 29(1):2592–2596.Crossref, Google Scholar
- (1963) Some theorems concerning the law of large numbers for non-homogeneous Markoff chains. Zeitschrift Wahrscheinlichkeitstheorie Verwandte Gebiete 1(5):433–445.Crossref, Google Scholar
- (1964) Some theorems concerning the strong law of large numbers for non-homogeneous Markov chains. Ann. Math. Statist. 35(2):566–576.Crossref, Google Scholar
- (1995) Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 90(430):558–566.Crossref, Google Scholar
- (2014) Estimation of the transition density of a Markov chain. Annales l’IHP Probabilités Statistiques 50(3):1028–1068.Google Scholar
- (2011) Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine Learn. 84(1):109–136.Crossref, Google Scholar
- (2016) System identification in the presence of adversarial outputs. 2016 IEEE 55th Conf. Decision Control (IEEE, Piscataway, NJ), 7177–7182.Google Scholar
- Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (2018) Principles of System Identification: Theory and Practice (CRC Press, Boca Raton, FL).Crossref, Google Scholar
- (2009) Introduction to Nonparametric Estimation (Springer, New York).Crossref, Google Scholar
- (2006) Minimax estimators and their admissibility. Restricted Parameter Space Estimation Problems: Admissibility and Minimaxity Properties (Springer, New York), 33–67.Crossref, Google Scholar
- (2006) A learning theory approach to system identification and stochastic adaptive control. Probabilistic and Randomized Methods for Design Under Uncertainty (Springer, London), 265–302.Crossref, Google Scholar
- Wang S, Si N, Blanchet J, Zhou Z (2023) On the foundation of distributionally robust reinforcement learning. Preprint, submitted November 15, https://arxiv.org/abs/2311.09018.Google Scholar
- (2021) Statistical estimation of ergodic Markov chain kernel over discrete state space. Bernoulli 27(1):532–553.Crossref, Google Scholar
- (1979) Nonparametric estimation of Markov transition functions. Ann. Statist. 7(3):671–679.Crossref, Google Scholar
- (1989) Nonparametric density and regression estimation for Markov sequences without mixing assumptions. J. Multivariate Anal. 30(1):124–136.Crossref, Google Scholar
- (2022) Model-based reinforcement learning is minimax-optimal for offline zero-sum Markov games. Preprint, submitted June 8, https://arxiv.org/abs/2206.04044.Google Scholar
- (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Surveys 55(1):1–36.Crossref, Google Scholar
- (2020) MOPO: Model-based offline policy optimization. Adv. Neural Inform. Processing Systems 33:14129–14142.Google Scholar
- (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Internat. Conf. Machine Learn. (PMLR, New York), 7304–7312.Google Scholar

