Fast Rates for the Regret of Offline Reinforcement Learning
References
- [1] (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Thirty Third Conf. Learn. Theory. Proc. Machine Learn. Res., vol. 125 (PMLR, New York), 67–83.Google Scholar
- [2] (2020) Reinforcement learning: Theory and algorithms. Accessed January 10, 2021, https://rltheorybook.github.io/.Google Scholar
- [3] (2008) Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learn. 71(1):89–129.Crossref, Google Scholar
- [4] (2007) Fast learning rates for plug-in classifiers. Ann. Statist. 35(2):608–633.Crossref, Google Scholar
- [5] (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
- [6] (2019) Information-theoretic considerations in batch reinforcement learning. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 1042–1051.Google Scholar
- [7] (2018) SBEED: Convergent reinforcement learning with nonlinear function approximation. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 1125–1134.Google Scholar
- [8] (2020) Is a good representation sufficient for sample efficient reinforcement learning? Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- [9] (2019) Provably efficient q-learning with function approximation via distribution shift error checking oracle. Adv. Neural Inform. Processing Systems 32 (Curran Associates, Inc., Red Hook, NY).Google Scholar
- [10] (2020) Minimax-optimal off-policy evaluation with linear function approximation. Proc. 37th Internat. Conf. Machine Learn. Proc. Machine Learn. Res., vol. 119 (PMLR, New York), 2701–2709.Google Scholar
- [11] (2005) Tree-based batch mode reinforcement learning. J. Machine Learn. Res. 6(Apr):503–556.Google Scholar
- [12] (2020) A theoretical analysis of deep q-learning. Proc. 2nd Conf. Learn. Dynamics Control. Proc. Machine Learn. Res., vol. 120 (PMLR, New York), 486–489.Google Scholar
- [13] (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
- [14] (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
- [15] (2022) Fast rates for contextual linear optimization. Management Sci. 68(6):4236–4245.Link, Google Scholar
- [16] (2022) Smooth contextual bandits: Bridging the parametric and nondifferentiable regret regimes. Oper. Res. 70(6):3261–3281.Link, Google Scholar
- [17] (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
- [18] (2020) Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Machine Learn. Res. 21(167):1–63.Google Scholar
- [19] (2020) Statistically efficient off-policy policy gradients. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5089–5100.Google Scholar
- [20] (2022) Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Oper. Res. 70(6):3282–3302.Link, Google Scholar
- [21] (2004) Least-squares policy iteration. J. Machine Learn. Res. 4(6):1107–1149.Google Scholar
- [22] (2010) Finite-sample analysis of LSTD. Proc. 27th Internat. Conf. Machine Learn. (PMLR, New York), 615–622.Google Scholar
- [23] (2018) Breaking the curse of horizon: Infinite-horizon off-policy estimation. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 5361–5371.Google Scholar
- [24] (2020) Provably good batch off-policy reinforcement learning without great exploration. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 1264–1274.Google Scholar
- [25] (2020) Performance guarantees for policy learning. Ann. Inst. Henri Poincare Probab. Statist. 56(3):2162–2188.Google Scholar
- [26] (1999) Smooth discrimination analysis. Ann. Statist. 27(6):1808–1829.Crossref, Google Scholar
- [27] (2003) Error bounds for approximate policy iteration. Proc. 20th Internat. Conf. Machine Learn. (PMLR, New York), 560–567.Google Scholar
- [28] (2005) Error bounds for approximate value iteration. Proc. Natl. Conf. Artificial Intelligence, vol. 20 (AAAI Press, Menlo Park, CA).Google Scholar
- [29] (2008) Finite-time bounds for fitted value iteration. J. Machine Learn. Res. 9(May):815–857.Google Scholar
- [30] (2019) Algaedice: Policy gradient from arbitrary experience. Preprint, submitted December 4, https://arxiv.org/abs/1912.02074.Google Scholar
- [31] (2013) The multi-armed bandit problem with covariates. Ann. Statist. 41(2):693–721.Crossref, Google Scholar
- [32] (1990) Empirical Processes: Theory and Applications, NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 2 (Institute of Mathematical Statistics, Hayward, CA).Crossref, Google Scholar
- [33] (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. Conf. Machine Learn. (PMLR, New York), 759–766.Google Scholar
- [34] (2014) Approximate policy iteration schemes: A comparison. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn. Proc. Machine Learn. Res., vol. 32 (PMLR, New York), 1314–1322.Google Scholar
- [35] (2019) Non-asymptotic gap-dependent regret bounds for tabular MDPS. 33rd Conf. Neural Inform. Processing Systems (NeurIPS 2019) (Curran Associates, Inc., Red Hook, NY), 1153–1162.Google Scholar
- [36] (1994) An upper bound on the loss from approximate optimal-value functions. Machine Learn. 16(3):227–233.Crossref, Google Scholar
- [37] (2016) Data-efficient off-policy policy evaluation for reinforcement learning. Proc. 33rd Internat. Conf. Machine Learn. (PMLR, New York), 2139–2148.Google Scholar
- [38] (2015) An introduction to matrix concentration inequalities. Foundations Trends Machine Learn. 8(1–2):1–230.Crossref, Google Scholar
- [39] (2004) Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32(1):135–166.Crossref, Google Scholar
- [40] (2023) Offline minimax soft-Q-learning under realizability and partial coverage. Preprint, submitted February 5, https://arxiv.org/abs/2302.02392.Google Scholar
- [41] (2019) High-Dimensional Statistics: A Non-asymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [42] (2021) What are the statistical limits of offline rl with linear function approximation? Internat. Conf. Learn. Representation (OpenReview.net).Google Scholar
- [43] (2020) Q* approximation schemes for batch reinforcement learning: A theoretical comparison. Proc. 36th Conf. Uncertainty Artificial Intelligence (UAI). Proc. Machine Learn. Res., vol. 124 (PMLR, New York).Google Scholar
- [44] (2021) Q-learning with logarithmic regret. Internat. Conf. Artificial Intelligence Statist. Proc. Machine Learn. Res. (PMLR, New York), 1576–1584.Google Scholar
- [45] (2021) Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. Internat. Conf. Artificial Intelligence Statist. Proc. Machine Learn. Res., vol. 130 (PMLR, New York), 1567–1575.Google Scholar

