[1] Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Thirty Third Conf. Learn. Theory. Proc. Machine Learn. Res., vol. 125 (PMLR, New York), 67–83.Google Scholar
[2] Agarwal A, Jiang N, Kakade SM, Sun W (2020) Reinforcement learning: Theory and algorithms. Accessed January 10, 2021, https://rltheorybook.github.io/.Google Scholar
[3] Antos A, Szepesvári C, Munos R (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.Crossref, Google Scholar
[4] Audibert JY, Tsybakov AB (2007) Fast learning rates for plug-in classifiers. Ann. Statist. 35(2):608–633.Crossref, Google Scholar
[5] Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
[6] Chen J, Jiang N (2019) Information-theoretic considerations in batch reinforcement learning. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 1042–1051.Google Scholar
[7] Dai B, Shaw A, Li L, Xiao L, He N, Liu Z, Chen J, Song L (2018) SBEED: Convergent reinforcement learning with nonlinear function approximation. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 1125–1134.Google Scholar
[8] Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
[9] Du SS, Luo Y, Wang R, Zhang H (2019) Provably efficient q-learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 32 (Curran Associates, Inc., Red Hook, NY).Google Scholar
[10] Duan Y, Jia Z, Wang M (2020) Minimax-optimal off-policy evaluation with linear function approximation. Proc. 37th Internat. Conf. Machine Learn. Proc. Machine Learn. Res., vol. 119 (PMLR, New York), 2701–2709.Google Scholar
[11] Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J. Machine Learn. Res. 6(Apr):503–556.Google Scholar
[12] Fan J, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep q-learning. Proc. 2nd Conf. Learn. Dynamics Control. Proc. Machine Learn. Res., vol. 120 (PMLR, New York), 486–489.Google Scholar
[13] Gheshlaghi Azar M, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
[14] Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
[15] Hu Y, Kallus N, Mao X (2022) Fast rates for contextual linear optimization. Management Sci. 68(6):4236–4245.Link, Google Scholar
[16] Hu Y, Kallus N, Mao X (2022) Smooth contextual bandits: Bridging the parametric and nondifferentiable regret regimes. Oper. Res. 70(6):3261–3281.Link, Google Scholar
[17] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
[18] Kallus N, Uehara M (2020) Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Machine Learn. Res. 21(167):1–63.Google Scholar
[19] Kallus N, Uehara M (2020) Statistically efficient off-policy policy gradients. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5089–5100.Google Scholar
[20] Kallus N, Uehara M (2022) Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Oper. Res. 70(6):3282–3302.Link, Google Scholar
[21] Lagoudakis M, Parr R (2004) Least-squares policy iteration. J. Machine Learn. Res. 4(6):1107–1149.Google Scholar
[22] Lazaric A, Ghavamzadeh M, Munos R (2010) Finite-sample analysis of LSTD. Proc. 27th Internat. Conf. Machine Learn. (PMLR, New York), 615–622.Google Scholar
[23] Liu Q, Li L, Tang Z, Zhou D (2018) Breaking the curse of horizon: Infinite-horizon off-policy estimation. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 5361–5371.Google Scholar
[24] Liu Y, Swaminathan A, Agarwal A, Brunskill E (2020) Provably good batch off-policy reinforcement learning without great exploration. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 1264–1274.Google Scholar
[25] Luedtke A, Chambaz A (2020) Performance guarantees for policy learning. Ann. Inst. Henri Poincare Probab. Statist. 56(3):2162–2188.Google Scholar
[26] Mammen E, Tsybakov AB (1999) Smooth discrimination analysis. Ann. Statist. 27(6):1808–1829.Crossref, Google Scholar
[27] Munos R (2003) Error bounds for approximate policy iteration. Proc. 20th Internat. Conf. Machine Learn. (PMLR, New York), 560–567.Google Scholar
[28] Munos R (2005) Error bounds for approximate value iteration. Proc. Natl. Conf. Artificial Intelligence, vol. 20 (AAAI Press, Menlo Park, CA).Google Scholar
[29] Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
[30] Nachum O, Dai B, Kostrikov I, Chow Y, Li L, Schuurmans D (2019) Algaedice: Policy gradient from arbitrary experience. Preprint, submitted December 4, https://arxiv.org/abs/1912.02074.Google Scholar
[31] Perchet V, Rigollet P (2013) The multi-armed bandit problem with covariates. Ann. Statist. 41(2):693–721.Crossref, Google Scholar
[32] Pollard D (1990) Empirical Processes: Theory and Applications, NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 2 (Institute of Mathematical Statistics, Hayward, CA).Crossref, Google Scholar
[33] Precup D, Sutton R, Singh S (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. Conf. Machine Learn. (PMLR, New York), 759–766.Google Scholar
[34] Scherrer B (2014) Approximate policy iteration schemes: A comparison. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn. Proc. Machine Learn. Res., vol. 32 (PMLR, New York), 1314–1322.Google Scholar
[35] Simchowitz M, Jamieson KG (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPS. 33rd Conf. Neural Inform. Processing Systems (NeurIPS 2019) (Curran Associates, Inc., Red Hook, NY), 1153–1162.Google Scholar
[36] Singh SP, Yee RC (1994) An upper bound on the loss from approximate optimal-value functions. Machine Learn. 16(3):227–233.Crossref, Google Scholar
[37] Thomas P, Brunskill E (2016) Data-efficient off-policy policy evaluation for reinforcement learning. Proc. 33rd Internat. Conf. Machine Learn. (PMLR, New York), 2139–2148.Google Scholar
[38] Tropp JA (2015) An introduction to matrix concentration inequalities. Foundations Trends Machine Learn. 8(1–2):1–230.Crossref, Google Scholar
[39] Tsybakov AB (2004) Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32(1):135–166.Crossref, Google Scholar
[40] Uehara M, Kallus N, Lee JD, Sun W (2023) Offline minimax soft-Q-learning under realizability and partial coverage. Preprint, submitted February 5, https://arxiv.org/abs/2302.02392.Google Scholar
[41] Wainwright MJ (2019) High-Dimensional Statistics: A Non-asymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[42] Wang R, Foster D, Kakade SM (2021) What are the statistical limits of offline rl with linear function approximation? Internat. Conf. Learn. Representation (OpenReview.net).Google Scholar
[43] Xie T, Jiang N (2020) Q* approximation schemes for batch reinforcement learning: A theoretical comparison. Proc. 36th Conf. Uncertainty Artificial Intelligence (UAI). Proc. Machine Learn. Res., vol. 124 (PMLR, New York).Google Scholar
[44] Yang K, Yang L, Du S (2021) Q-learning with logarithmic regret. Internat. Conf. Artificial Intelligence Statist. Proc. Machine Learn. Res. (PMLR, New York), 1576–1584.Google Scholar
[45] Yin M, Bai Y, Wang Y (2021) Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. Internat. Conf. Artificial Intelligence Statist. Proc. Machine Learn. Res., vol. 130 (PMLR, New York), 1567–1575.Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 1

February 2025

Pages 1-781 C2

Article Information

Metrics

Information

Received:July 09, 2021
Accepted:March 05, 2024
Published Online:March 25, 2024

Cite as

Yichun Hu, Nathan Kallus, Masatoshi Uehara (2024) Fast Rates for the Regret of Offline Reinforcement Learning. Mathematics of Operations Research 50(1):633-655.

https://doi.org/10.1287/moor.2021.0167

Keywords

Acknowledgments

The authors thank Yoav Kallus for providing the geometric argument to improve the dependence of the δ0 constant on |A| in the proof of Lemma 1, which is based on a proof by Noam Elkies of the monotonicity of perimeter for convex sets under containment (https://mathoverflow.net/questions/71502/circumference-of-convex-shapes/71505#71505). The authors acknowledge the helpful and constructive comments by the anonymous reviewers from COLT 2021, where an extended abstract of this work was presented.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Fast Rates for the Regret of Offline Reinforcement Learning

References

Volume 50, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News