Off-line Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

Imon Banerjee
Corresponding Author
Imon Banerjee
[email protected]
https://orcid.org/0000-0003-2572-3048
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Harsha Honnappa
Harsha Honnappa
[email protected]
https://orcid.org/0000-0002-0834-054X
Edwardson School of Industrial Engineering, Purdue University, West Lafayette, Indiana 47907
Search for more papers by this author
,
Vinayak Rao
Vinayak Rao
[email protected]
https://orcid.org/0000-0002-6249-2923
Department of Statistics, Purdue University, West Lafayette, Indiana 47907
Search for more papers by this author

Imon Banerjee

Corresponding Author

Imon Banerjee

[email protected]

https://orcid.org/0000-0003-2572-3048

Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Harsha Honnappa

[email protected]

https://orcid.org/0000-0002-0834-054X

Edwardson School of Industrial Engineering, Purdue University, West Lafayette, Indiana 47907

Search for more papers by this author

Vinayak Rao

[email protected]

https://orcid.org/0000-0002-6249-2923

Department of Statistics, Purdue University, West Lafayette, Indiana 47907

Search for more papers by this author

Published Online:21 Feb 2025https://doi.org/10.1287/opre.2023.0046

References

Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Thirty Third Conf. Learn. Theory (PMLR, New York), 67–83.Google Scholar
Armony M, Israelit S, Mandelbaum A, Marmor YN, Tseytlin Y, Yom-Tov GB (2015) On patient flow in hospitals: A data-based queueing-science perspective. Stochastic Systems 5(1):146–194.Link, Google Scholar
Banerjee I, Rao VA, Honnappa H (2021) PAC-Bayes bounds on variational tempered posteriors for Markov models. Entropy 23(3):313.Crossref, Google Scholar
Bertsekas DP (2011) Dynamic Programming and Optimal Control, 3rd ed., vol. II (Athena Scientific, Belmont, MA).Google Scholar
Billingsley P (1961) Statistical methods in Markov chains. Ann. Math. Statist. 32(1):12–40.Crossref, Google Scholar
Borkar VS (1991) Topics in Controlled Markov Chains (Longman, Harlow, UK).Google Scholar
Bradley RC (2005) Basic properties of strong mixing conditions: A survey and some open questions. Probab. Surveys 2:107–144.Crossref, Google Scholar
Brandwein AC, Strawderman WE (1980) Minimax estimation of location parameters for spherically symmetric distributions with concave loss. Ann. Statist. 8(2):279–284.Crossref, Google Scholar
Chen IY, Joshi S, Ghassemi M, Ranganath R (2021) Probabilistic machine learning for healthcare. Annual Rev. Biomedical Data Sci. 4:393–415.Crossref, Google Scholar
Chin R, Maass AI, Ulapane N, Manzie C, Shames I, Nešić D, Rowe JE, Nakada H (2020) Active learning for linear parameter-varying system identification. IFAC-PapersOnLine 53(2):989–994.Crossref, Google Scholar
Dobrushin RL (1956a) Central limit theorem for nonstationary Markov chains. I. Theory Probab. Appl. 1(1):65–80.Crossref, Google Scholar
Dobrushin RL (1956b) Central limit theorem for nonstationary Markov chains. II. Theory Probab. Appl. 1(4):329–383.Crossref, Google Scholar
Fourdrinier D, Mezoued F, Strawderman WE (2013) Bayes minimax estimation under power priors of location parameters for a wide class of spherically symmetric distributions. Electronic J. Statist. 7:717–741.Crossref, Google Scholar
Geyer CJ (1998) Markov chain Monte Carlo lecture notes. Course Notes, Spring Quarter 80, Minneapolis.Google Scholar
Goldberg DA, Reiman MI, Wang Q (2021) A survey of recent progress in the asymptotic analysis of inventory systems. Production Oper. Management 30(6):1718–1750.Google Scholar
Gut A, Gut A (2005) Probability: A Graduate Course, vol. 5 (Springer, New York).Google Scholar
Hernández-Lerma O, Montes-de Oca R, Cavazos-Cadena R (1991) Recurrence conditions for Markov decision processes with Borel state space: A survey. Ann. Oper. Res. 28(1):29–46.Crossref, Google Scholar
Icarte RT, Klassen T, Valenzano R, McIlraith S (2018) Using reward machines for high-level task specification and decomposition in reinforcement learning. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 2107–2116.Google Scholar
Jones GL (2004) On the Markov chain central limit theorem. Probab. Surveys 1:299–320.Crossref, Google Scholar
Khamaru K, Xia E, Wainwright MJ, Jordan MI (2022) Instance-dependent confidence and early stopping for reinforcement learning. Preprint, submitted January 1, https://arxiv.org/abs/2201.08536.Google Scholar
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Adv. Neural Inform. Processing Systems 33:21810–21823.Google Scholar
Kontorovich LA, Ramanan K (2008) Concentration inequalities for dependent random variables via the martingale method. Ann. Probab. 36(6):2126–2158.Crossref, Google Scholar
Laroche R, Tachet Des Combes R (2023) On the occupancy measure of non-Markovian policies in continuous MDPs. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato, S Scarlett J, eds. Proc. 40th Internat. Conf. Machine Learn., vol. 202 (PMLR, New York), 18548–18562.Google Scholar
Laroche R, Combes RTD, Buckman J (2022) Non-Markovian policies occupancy measures. Preprint, submitted May 27, https://arxiv.org/abs/2205.13950.Google Scholar
Lee HR, Lee T (2018) Markov decision process model for patient admission decision at an emergency department under a surge demand. Flexible Services Manufacturing J. 30(1):98–122.Google Scholar
Lehmann EL, Casella G (1998) Theory of Point Estimation, 2nd ed. (Springer, New York), 150.Google Scholar
Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
Li Y, Wang R, Yang LF (2022b) Settling the horizon-dependence of sample complexity in reinforcement learning. 2021 IEEE 62nd Annual Sympos. Foundations Computer Sci. (IEEE, Piscataway, NJ), 965–976.Google Scholar
Li G, Shi L, Chen Y, Chi Y, Wei Y (2022a) Settling the sample complexity of model-based offline reinforcement learning. Preprint, submitted April 11, https://arxiv.org/abs/2204.05275.Google Scholar
Liu S, See KC, Ngiam KY, Celi LA, Sun X, Feng M (2020) Reinforcement learning for clinical decision support in critical care: Comprehensive review. J. Medical Internet Res. 22(7):e18477.Crossref, Google Scholar
Ljung L (1987) System Identification: Theory for the User (Prentice-Hall, Englewood Cliffs, NJ).Google Scholar
Ljung L (2010) Perspectives on system identification. Annual Rev. Control 34(1):1–12.Crossref, Google Scholar
Löffler M, Picard A (2021) Spectral thresholding for the estimation of Markov chain transition operators. Electronic J. Statist. 15(2):6281–6310.Crossref, Google Scholar
Mania H, Jordan MI, Recht B (2020) Active learning for nonlinear system identification with guarantees. Preprint, submitted June 18, https://arxiv.org/abs/2006.10277.Google Scholar
Mannor S, Tsitsiklis JN (2005) On the empirical state-action frequencies in Markov decision processes under general policies. Math. Oper. Res. 30(3):545–561.Link, Google Scholar
Meyn SP, Tweedie RL (2012) Markov Chains and Stochastic Stability (Springer Science & Business Media, London).Google Scholar
Mou W, Pananjady A, Wainwright MJ, Bartlett PL (2021) Optimal and instance-dependent guarantees for Markovian linear stochastic approximation. Preprint, submitted December 23, https://arxiv.org/abs/2112.12770.Google Scholar
Mukhamedov F (2013) The Dobrushin ergodicity coefficient and the ergodicity of noncommutative Markov chains. J. Math. Anal. Appl. 408(1):364–373.Crossref, Google Scholar
Mutti M, De Santi R, Restelli M (2022) The importance of non-Markovianity in maximum state entropy exploration. Preprint, submitted February 7, https://arxiv.org/abs/2202.03060.Google Scholar
Pinto L, Davidson J, Sukthankar R, Gupta A (2017) Robust adversarial reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 2817–2826.Google Scholar
RayChaudhuri T, Hamey LG (1996) Active learning for nonlinear system identification and control. IFAC Proc. Volumes 29(1):2592–2596.Crossref, Google Scholar
Rosenblatt-Roth M (1963) Some theorems concerning the law of large numbers for non-homogeneous Markoff chains. Zeitschrift Wahrscheinlichkeitstheorie Verwandte Gebiete 1(5):433–445.Crossref, Google Scholar
Rosenblatt-Roth M (1964) Some theorems concerning the strong law of large numbers for non-homogeneous Markov chains. Ann. Math. Statist. 35(2):566–576.Crossref, Google Scholar
Rosenthal JS (1995) Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 90(430):558–566.Crossref, Google Scholar
Sart M (2014) Estimation of the transition density of a Markov chain. Annales l’IHP Probabilités Statistiques 50(3):1028–1068.Google Scholar
Shortreed SM, Laber E, Lizotte DJ, Stroup TS, Pineau J, Murphy SA (2011) Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine Learn. 84(1):109–136.Crossref, Google Scholar
Showkatbakhsh M, Tabuada P, Diggavi S (2016) System identification in the presence of adversarial outputs. 2016 IEEE 55th Conf. Decision Control (IEEE, Piscataway, NJ), 7177–7182.Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Tangirala AK (2018) Principles of System Identification: Theory and Practice (CRC Press, Boca Raton, FL).Crossref, Google Scholar
Tsybakov AB (2009) Introduction to Nonparametric Estimation (Springer, New York).Crossref, Google Scholar
van Eeden C (2006) Minimax estimators and their admissibility. Restricted Parameter Space Estimation Problems: Admissibility and Minimaxity Properties (Springer, New York), 33–67.Crossref, Google Scholar
Vidyasagar M, Karandikar RL (2006) A learning theory approach to system identification and stochastic adaptive control. Probabilistic and Randomized Methods for Design Under Uncertainty (Springer, London), 265–302.Crossref, Google Scholar
Wang S, Si N, Blanchet J, Zhou Z (2023) On the foundation of distributionally robust reinforcement learning. Preprint, submitted November 15, https://arxiv.org/abs/2311.09018.Google Scholar
Wolfer G, Kontorovich A (2021) Statistical estimation of ergodic Markov chain kernel over discrete state space. Bernoulli 27(1):532–553.Crossref, Google Scholar
Yakowitz S (1979) Nonparametric estimation of Markov transition functions. Ann. Statist. 7(3):671–679.Crossref, Google Scholar
Yakowitz S (1989) Nonparametric density and regression estimation for Markov sequences without mixing assumptions. J. Multivariate Anal. 30(1):124–136.Crossref, Google Scholar
Yan Y, Li G, Chen Y, Fan J (2022) Model-based reinforcement learning is minimax-optimal for offline zero-sum Markov games. Preprint, submitted June 8, https://arxiv.org/abs/2206.04044.Google Scholar
Yu C, Liu J, Nemati S, Yin G (2021) Reinforcement learning in healthcare: A survey. ACM Comput. Surveys 55(1):1–36.Crossref, Google Scholar
Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) MOPO: Model-based offline policy optimization. Adv. Neural Inform. Processing Systems 33:14129–14142.Google Scholar
Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Internat. Conf. Machine Learn. (PMLR, New York), 7304–7312.Google Scholar

Volume 73, Issue 4

July-August 2025

Pages iii-viii, 1723-2295, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:January 29, 2023
Accepted:November 26, 2024
Published Online:February 21, 2025

Cite as

Imon Banerjee, Harsha Honnappa, Vinayak Rao (2025) Off-line Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity. Operations Research 73(4):2281-2295.

https://doi.org/10.1287/opre.2023.0046

Keywords

Acknowledgments

I. Banerjee thanks Anamitra Chaudhuri for numerous insightful discussions and comments throughout the duration of this project. The authors thank the anonymous reviewers for their insightful comments and especially for pointing them toward the interesting application detailed in Section 4.5. Work was completed while Imon Banerjee was at the Department of Statistics, Purdue University.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Off-line Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

References

Volume 73, Issue 4

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News