Fast Rates for the Regret of Offline Reinforcement Learning

Published Online:https://doi.org/10.1287/moor.2021.0167

References

  • [1] Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Thirty Third Conf. Learn. Theory. Proc. Machine Learn. Res., vol. 125 (PMLR, New York), 67–83.Google Scholar
  • [2] Agarwal A, Jiang N, Kakade SM, Sun W (2020) Reinforcement learning: Theory and algorithms. Accessed January 10, 2021, https://rltheorybook.github.io/.Google Scholar
  • [3] Antos A, Szepesvári C, Munos R (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.CrossrefGoogle Scholar
  • [4] Audibert JY, Tsybakov AB (2007) Fast learning rates for plug-in classifiers. Ann. Statist. 35(2):608–633.CrossrefGoogle Scholar
  • [5] Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.LinkGoogle Scholar
  • [6] Chen J, Jiang N (2019) Information-theoretic considerations in batch reinforcement learning. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 1042–1051.Google Scholar
  • [7] Dai B, Shaw A, Li L, Xiao L, He N, Liu Z, Chen J, Song L (2018) SBEED: Convergent reinforcement learning with nonlinear function approximation. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 1125–1134.Google Scholar
  • [8] Du SS, Kakade SM, Wang R, Yang LF (2020) Is a good representation sufficient for sample efficient reinforcement learning? Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • [9] Du SS, Luo Y, Wang R, Zhang H (2019) Provably efficient q-learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 32 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • [10] Duan Y, Jia Z, Wang M (2020) Minimax-optimal off-policy evaluation with linear function approximation. Proc. 37th Internat. Conf. Machine Learn. Proc. Machine Learn. Res., vol. 119 (PMLR, New York), 2701–2709.Google Scholar
  • [11] Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J. Machine Learn. Res. 6(Apr):503–556.Google Scholar
  • [12] Fan J, Wang Z, Xie Y, Yang Z (2020) A theoretical analysis of deep q-learning. Proc. 2nd Conf. Learn. Dynamics Control. Proc. Machine Learn. Res., vol. 120 (PMLR, New York), 486–489.Google Scholar
  • [13] Gheshlaghi Azar M, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.CrossrefGoogle Scholar
  • [14] Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.LinkGoogle Scholar
  • [15] Hu Y, Kallus N, Mao X (2022) Fast rates for contextual linear optimization. Management Sci. 68(6):4236–4245.LinkGoogle Scholar
  • [16] Hu Y, Kallus N, Mao X (2022) Smooth contextual bandits: Bridging the parametric and nondifferentiable regret regimes. Oper. Res. 70(6):3261–3281.LinkGoogle Scholar
  • [17] Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
  • [18] Kallus N, Uehara M (2020) Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Machine Learn. Res. 21(167):1–63.Google Scholar
  • [19] Kallus N, Uehara M (2020) Statistically efficient off-policy policy gradients. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5089–5100.Google Scholar
  • [20] Kallus N, Uehara M (2022) Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Oper. Res. 70(6):3282–3302.LinkGoogle Scholar
  • [21] Lagoudakis M, Parr R (2004) Least-squares policy iteration. J. Machine Learn. Res. 4(6):1107–1149.Google Scholar
  • [22] Lazaric A, Ghavamzadeh M, Munos R (2010) Finite-sample analysis of LSTD. Proc. 27th Internat. Conf. Machine Learn. (PMLR, New York), 615–622.Google Scholar
  • [23] Liu Q, Li L, Tang Z, Zhou D (2018) Breaking the curse of horizon: Infinite-horizon off-policy estimation. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 5361–5371.Google Scholar
  • [24] Liu Y, Swaminathan A, Agarwal A, Brunskill E (2020) Provably good batch off-policy reinforcement learning without great exploration. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 1264–1274.Google Scholar
  • [25] Luedtke A, Chambaz A (2020) Performance guarantees for policy learning. Ann. Inst. Henri Poincare Probab. Statist. 56(3):2162–2188.Google Scholar
  • [26] Mammen E, Tsybakov AB (1999) Smooth discrimination analysis. Ann. Statist. 27(6):1808–1829.CrossrefGoogle Scholar
  • [27] Munos R (2003) Error bounds for approximate policy iteration. Proc. 20th Internat. Conf. Machine Learn. (PMLR, New York), 560–567.Google Scholar
  • [28] Munos R (2005) Error bounds for approximate value iteration. Proc. Natl. Conf. Artificial Intelligence, vol. 20 (AAAI Press, Menlo Park, CA).Google Scholar
  • [29] Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
  • [30] Nachum O, Dai B, Kostrikov I, Chow Y, Li L, Schuurmans D (2019) Algaedice: Policy gradient from arbitrary experience. Preprint, submitted December 4, https://arxiv.org/abs/1912.02074.Google Scholar
  • [31] Perchet V, Rigollet P (2013) The multi-armed bandit problem with covariates. Ann. Statist. 41(2):693–721.CrossrefGoogle Scholar
  • [32] Pollard D (1990) Empirical Processes: Theory and Applications, NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 2 (Institute of Mathematical Statistics, Hayward, CA).CrossrefGoogle Scholar
  • [33] Precup D, Sutton R, Singh S (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. Conf. Machine Learn. (PMLR, New York), 759–766.Google Scholar
  • [34] Scherrer B (2014) Approximate policy iteration schemes: A comparison. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn. Proc. Machine Learn. Res., vol. 32 (PMLR, New York), 1314–1322.Google Scholar
  • [35] Simchowitz M, Jamieson KG (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPS. 33rd Conf. Neural Inform. Processing Systems (NeurIPS 2019) (Curran Associates, Inc., Red Hook, NY), 1153–1162.Google Scholar
  • [36] Singh SP, Yee RC (1994) An upper bound on the loss from approximate optimal-value functions. Machine Learn. 16(3):227–233.CrossrefGoogle Scholar
  • [37] Thomas P, Brunskill E (2016) Data-efficient off-policy policy evaluation for reinforcement learning. Proc. 33rd Internat. Conf. Machine Learn. (PMLR, New York), 2139–2148.Google Scholar
  • [38] Tropp JA (2015) An introduction to matrix concentration inequalities. Foundations Trends Machine Learn. 8(1–2):1–230.CrossrefGoogle Scholar
  • [39] Tsybakov AB (2004) Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32(1):135–166.CrossrefGoogle Scholar
  • [40] Uehara M, Kallus N, Lee JD, Sun W (2023) Offline minimax soft-Q-learning under realizability and partial coverage. Preprint, submitted February 5, https://arxiv.org/abs/2302.02392.Google Scholar
  • [41] Wainwright MJ (2019) High-Dimensional Statistics: A Non-asymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, 48 (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [42] Wang R, Foster D, Kakade SM (2021) What are the statistical limits of offline rl with linear function approximation? Internat. Conf. Learn. Representation (OpenReview.net).Google Scholar
  • [43] Xie T, Jiang N (2020) Q* approximation schemes for batch reinforcement learning: A theoretical comparison. Proc. 36th Conf. Uncertainty Artificial Intelligence (UAI). Proc. Machine Learn. Res., vol. 124 (PMLR, New York).Google Scholar
  • [44] Yang K, Yang L, Du S (2021) Q-learning with logarithmic regret. Internat. Conf. Artificial Intelligence Statist. Proc. Machine Learn. Res. (PMLR, New York), 1576–1584.Google Scholar
  • [45] Yin M, Bai Y, Wang Y (2021) Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. Internat. Conf. Artificial Intelligence Statist. Proc. Machine Learn. Res., vol. 130 (PMLR, New York), 1567–1575.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.