Reliable Off-Policy Evaluation for Reinforcement Learning
References
- (2019) Wasserstein robust reinforcement learning. Preprint, submitted September 16, https://arxiv.org/abs/1907.13196.Google Scholar
- (2008) Gradient Flows: In Metric Spaces and in the Space of Probability Measures (Springer Science & Business Media, Boston, MA).Google Scholar
- (2019) Quantifying distributional model risk via optimal transport. Math. Oper. Res. 44(2):565–600.Link, Google Scholar
- (2021) The importance of pessimism in fixed-data set policy optimization. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI).Google Scholar
- (2019a) Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 8172–8182.Google Scholar
- (2022) Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. Inform. Inference: J. IMA. https://doi.org/10.1093/imaiai/iaac001. Google Scholar
- (2020a) Statistical guarantees of generative adversarial networks for distribution estimation. Preprint, submitted July 16, https://arxiv.org/abs/2002.03938.Google Scholar
- (2020b) Infinite-horizon off-policy policy evaluation with multiple behavior policies. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI), 1–16.Google Scholar
- (2019b) Distributionally robust optimization for sequential decision-making. Optimization 68(12):2397–2426.Crossref, Google Scholar
- (2020) Coindice: Off-policy confidence interval estimation. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 9398–9411.Google Scholar
- (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI).Google Scholar
- (2020) Minimax-optimal off-policy evaluation with linear function approximation. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 2701–2709.Google Scholar
- (2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Programming 171(1–2):115–166.Crossref, Google Scholar
- (2018) More robust doubly robust off-policy evaluation. Dy J, Krause A, eds. Proc. Internat. Conf. on Machine Learn., vol. 80 (PMLR, New York), 1447–1456.Google Scholar
- (2020) Accountable off-policy evaluation with kernel bellman statistics. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 3102–3111.Google Scholar
- (2020) Wasserstein distributionally robust optimization and variation regularization. Preprint, submitted October 30, https://arxiv.org/abs/1712.06050.Google Scholar
- (2022) Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. 48(2):603–655.Link, Google Scholar
- (2014) Generative adversarial nets. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 27 (Curran Associates, Inc., Red Hook, NY), 2672–2680.Google Scholar
- (2019) Guidelines for reinforcement learning in healthcare. Nature Medicine 25(1):16–18.Crossref, Google Scholar
- (2022) Robust Markov decision processes: Beyond rectangularity. Math. Oper. Res. 48(1):203–226.Link, Google Scholar
- (2019) Universal function approximation by deep neural nets with bounded width and ReLU activations. Mathematics 7(10).Crossref, Google Scholar
- (2017) Bootstrapping with models: Confidence intervals for off-policy evaluation. Singh S, Markovitch S, eds. Proc. 31st AAAI Conf. on Artificial Intelligence, vol. 31 (AAAI Press, Palo Alto, CA), 4933–4934.Google Scholar
- (2021) Partial policy iteration for l1-robust Markov decision processes. J. Machine Learn. Res. 22(275):1–46.Google Scholar
- (2020) Robust reinforcement learning with Wasserstein constraint. Preprint, submitted June 1, https://arxiv.org/abs/2006.00945.Google Scholar
- (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.Link, Google Scholar
- (2018) Neural tangent kernel: Convergence and generalization in neural networks. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 8580–8589. Google Scholar
- (2020) Minimax value interval for off-policy evaluation and policy optimization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 2747–2758.Google Scholar
- (2016) Doubly robust off-policy value evaluation for reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. Internat. Conf. on Machine Learn., vol. 48 (PMLR, New York), 652–661.Google Scholar
- (2021) Is pessimism provably efficient for offline RL? Meila M, Zhang T, eds. Proc. Internat. Conf. on Machine Learn., vol. 139 (PMLR, New York), 5084–5096.Google Scholar
- (2020) Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Machine Learn. Res. 21:1–63.Google Scholar
- (2022) Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. Oper. Res. 70(6):3282–3302.Link, Google Scholar
- (2020) Morel: Model-based offline reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 21810–21823.Google Scholar
- (2013) Reinforcement learning in robotics: A survey. Internat. J. Robot. Res. 32(11):1238–1274.Crossref, Google Scholar
- (2020) Statistical bootstrapping for uncertainty estimation in off-policy evaluation. Preprint, submitted July 27, https://arxiv.org/abs/2007.13609.Google Scholar
- (2020) Conservative q-learning for offline reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 1179–1191.Google Scholar
- (2019) Optimization-based quantification of simulation input uncertainty via empirical likelihood. Preprint, submitted February 13, https://arxiv.org/abs/1707.05917.Google Scholar
- (2017) The empirical likelihood approach to quantifying uncertainty in sample average approximation. Oper. Res. Lett. 45(4):301–307.Crossref, Google Scholar
- (2018a) Deep reinforcement learning based recommendation with explicit user-item interactions modeling. Preprint, submitted October 29, https://arxiv.org/abs/1810.12027.Google Scholar
- (2018b) Breaking the curse of horizon: Infinite-horizon off-policy estimation. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 5356–5366.Google Scholar
- (2014) Offline policy evaluation across representations with applications to educational games. Proc. 13th Internat. Conf. on Autonomous Agents and Multiagent Systems, vol. 13 (International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC), 1077–1084.Google Scholar
- (2016) Robust MDPs with k-rectangular uncertainty. Math. Oper. Res. 41:1484–1509.Link, Google Scholar
- (2021) Deployment-efficient reinforcement learning via model-based offline optimization. Proc. Internat. Conf. on Learn. Representations.Google Scholar
- (2019) Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. Beygelzimer A, Hsu D, eds. Proc. 32nd Conf. on Learn. Theory, vol. 99 (PMLR, New York), 2388–2464.Google Scholar
- (2018) A mean field view of the landscape of two-layer neural networks. Proc. National Acad. Sci. USA 115(33):7665–7671.Crossref, Google Scholar
- (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.Crossref, Google Scholar
- (2020) Black-box off-policy estimation for infinite-horizon reinforcement learning. Proc. Internat. Conf. on Learn. Representations.Google Scholar
- (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations Trends Machine Learn. 7(1):1–129.Crossref, Google Scholar
- (2019) Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 2315–2325.Google Scholar
- (2005) Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5):780–798.Link, Google Scholar
- (2019) Dota 2 with large scale deep reinforcement learning. Preprint, submitted December 13, https://arxiv.org/abs/1912.06680.Google Scholar
- (2019) Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 7049–7058.Google Scholar
- (2000) Eligibility traces for off-policy policy evaluation. Langley P, ed. Proc. Internat. Conf. on Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 759–766.Google Scholar
- (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
- (2017) Deep reinforcement learning for sepsis treatment. Preprint, submitted November 27, https://arxiv.org/abs/1711.09602.Google Scholar
- (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017(19):70–76.Crossref, Google Scholar
- (2020) Statistical inference of the value function for reinforcement learning in infinite horizon settings. Preprint, submitted January 13, https://arxiv.org/abs/2001.04515.Google Scholar
- (2020) Distributionally robust policy evaluation and learning in offline contextual bandits. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 8884–8894.Google Scholar
- (2019) Minimax distribution estimation in Wasserstein distance. Preprint, submitted November 7, https://arxiv.org/abs/1802.08855.Google Scholar
- (2020) Mean field analysis of neural networks: A law of large numbers. SIAM J. Appl. Math. 80(2):725–752.Crossref, Google Scholar
- (2019) Distributionally robust reinforcement learning. Preprint, submitted June 14, https://arxiv.org/abs/1902.08708.Google Scholar
- (2020) Expert-supervised reinforcement learning for offline policy learning and evaluation. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 18967–18977.Google Scholar
- (2020) Optimistic distributionally robust policy optimization. Preprint, submitted June 14, https://arxiv.org/abs/2006.07815.Google Scholar
- (2019) Optimization for deep learning: Theory and algorithms. Preprint, submitted December 19, https://arxiv.org/abs/1912.08957.Google Scholar
- (2019) Doubly robust bias reduction in infinite horizon off-policy estimation. Proc. Internat. Conf. on Learn. Representations.Google Scholar
- (2015) High-confidence off-policy evaluation. Bonet B, Koenig S, eds. Proc. AAAI Conf. on Artificial Intelligence, vol. 29 (AAAI Press, San Francisco, CA), 3000–3006.Google Scholar
- (2017) Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. Singh S, Markovitch S, eds. Proc. AAAI Conf. on Artificial Intelligence, vol. 31 (AAAI Press, San Francisco, CA), 4740–4745.Google Scholar
- (2018) Policy-conditioned uncertainty sets for robust Markov decision processes. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 8953–8963.Google Scholar
- (2020a) Minimax weight and q-function learning for off-policy evaluation. Meila M, Zhang T, eds. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 9659–9668.Google Scholar
- (2020b) Off-policy evaluation and learning for external validity under a covariate shift. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 49–61.Google Scholar
- (2018) Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. Proc. 24th ACM SIGKDD Internat. Conf. on Knowledge Discovery & Data Mining, 2447–2456.Google Scholar
- (2013) Robust Markov decision processes. Math. Oper. Res. 38:153–183.Link, Google Scholar
- (2019) Toward optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 9668–9678.Google Scholar
- (2010) Distributionally robust Markov decision processes. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 2505–2513.Google Scholar
- (2017) A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Lett. 1(1):164–169.Crossref, Google Scholar
- (2020) Off-policy evaluation via the regularized Lagrangian. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 6551–6561.Google Scholar
- (2020) Mopo: Model-based offline policy optimization. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 14129–14142.Google Scholar
- (2020a) Gendice: Generalized offline estimation of stationary values. Proc. Internat. Conf. on Learn. Representations.Google Scholar
- (2020b) Can temporal-difference and q-learning learn representation? A mean-field theory. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 19680–19692.Google Scholar

