Reliable Off-Policy Evaluation for Reinforcement Learning

Published Online:https://doi.org/10.1287/opre.2022.2382

References

  • Abdullah MA, Ren H, Ammar HB, Milenkovic V, Luo R, Zhang M, Wang J (2019) Wasserstein robust reinforcement learning. Preprint, submitted September 16, https://arxiv.org/abs/1907.13196.Google Scholar
  • Ambrosio L, Gigli N, Savaré G (2008) Gradient Flows: In Metric Spaces and in the Space of Probability Measures (Springer Science & Business Media, Boston, MA).Google Scholar
  • Blanchet J, Murthy K (2019) Quantifying distributional model risk via optimal transport. Math. Oper. Res. 44(2):565–600.LinkGoogle Scholar
  • Buckman J, Gelada C, Bellemare MG (2021) The importance of pessimism in fixed-data set policy optimization. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • Chen M, Jiang H, Liao W, Zhao T (2019a) Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 8172–8182.Google Scholar
  • Chen M, Jiang H, Liao W, Zhao T (2022) Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. Inform. Inference: J. IMA. https://doi.org/10.1093/imaiai/iaac001. Google Scholar
  • Chen M, Liao W, Zha H, Zhao T (2020a) Statistical guarantees of generative adversarial networks for distribution estimation. Preprint, submitted July 16, https://arxiv.org/abs/2002.03938.Google Scholar
  • Chen X, Wang L, Hang Y, Ge H, Zha H (2020b) Infinite-horizon off-policy policy evaluation with multiple behavior policies. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI), 1–16.Google Scholar
  • Chen Z, Yu P, Haskell WB (2019b) Distributionally robust optimization for sequential decision-making. Optimization 68(12):2397–2426.CrossrefGoogle Scholar
  • Dai B, Nachum O, Chow Y, Li L, Szepesvari C, Schuurmans D (2020) Coindice: Off-policy confidence interval estimation. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 9398–9411.Google Scholar
  • Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • Duan Y, Jia Z, Wang M (2020) Minimax-optimal off-policy evaluation with linear function approximation. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 2701–2709.Google Scholar
  • Esfahani PM, Kuhn D (2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Programming 171(1–2):115–166.CrossrefGoogle Scholar
  • Farajtabar M, Chow Y, Ghavamzadeh M (2018) More robust doubly robust off-policy evaluation. Dy J, Krause A, eds. Proc. Internat. Conf. on Machine Learn., vol. 80 (PMLR, New York), 1447–1456.Google Scholar
  • Feng Y, Ren T, Tang Z, Liu Q (2020) Accountable off-policy evaluation with kernel bellman statistics. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 3102–3111.Google Scholar
  • Gao R, Chen X, Kleywegt AJ (2020) Wasserstein distributionally robust optimization and variation regularization. Preprint, submitted October 30, https://arxiv.org/abs/1712.06050.Google Scholar
  • Gao R, Kleywegt A (2022) Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. 48(2):603–655.LinkGoogle Scholar
  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 27 (Curran Associates, Inc., Red Hook, NY), 2672–2680.Google Scholar
  • Gottesman O, Johansson F, Komorowski M, Faisal A, Sontag D, Doshi-Velez F, Celi LA (2019) Guidelines for reinforcement learning in healthcare. Nature Medicine 25(1):16–18.CrossrefGoogle Scholar
  • Goyal V, Grand-Clement J (2022) Robust Markov decision processes: Beyond rectangularity. Math. Oper. Res. 48(1):203–226.LinkGoogle Scholar
  • Hanin B (2019) Universal function approximation by deep neural nets with bounded width and ReLU activations. Mathematics 7(10).CrossrefGoogle Scholar
  • Hanna JP, Stone P, Niekum S (2017) Bootstrapping with models: Confidence intervals for off-policy evaluation. Singh S, Markovitch S, eds. Proc. 31st AAAI Conf. on Artificial Intelligence, vol. 31 (AAAI Press, Palo Alto, CA), 4933–4934.Google Scholar
  • Ho CP, Petrik M, Wiesemann W (2021) Partial policy iteration for l1-robust Markov decision processes. J. Machine Learn. Res. 22(275):1–46.Google Scholar
  • Hou L, Pang L, Hong X, Lan Y, Ma Z, Yin D (2020) Robust reinforcement learning with Wasserstein constraint. Preprint, submitted June 1, https://arxiv.org/abs/2006.00945.Google Scholar
  • Iyengar GN (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.LinkGoogle Scholar
  • Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 8580–8589. Google Scholar
  • Jiang N, Huang J (2020) Minimax value interval for off-policy evaluation and policy optimization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 2747–2758.Google Scholar
  • Jiang N, Li L (2016) Doubly robust off-policy value evaluation for reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. Internat. Conf. on Machine Learn., vol. 48 (PMLR, New York), 652–661.Google Scholar
  • Jin Y, Yang Z, Wang Z (2021) Is pessimism provably efficient for offline RL? Meila M, Zhang T, eds. Proc. Internat. Conf. on Machine Learn., vol. 139 (PMLR, New York), 5084–5096.Google Scholar
  • Kallus N, Uehara M (2020) Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Machine Learn. Res. 21:1–63.Google Scholar
  • Kallus N, Uehara M (2022) Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. Oper. Res. 70(6):3282–3302.LinkGoogle Scholar
  • Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 21810–21823.Google Scholar
  • Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: A survey. Internat. J. Robot. Res. 32(11):1238–1274.CrossrefGoogle Scholar
  • Kostrikov I, Nachum O (2020) Statistical bootstrapping for uncertainty estimation in off-policy evaluation. Preprint, submitted July 27, https://arxiv.org/abs/2007.13609.Google Scholar
  • Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 1179–1191.Google Scholar
  • Lam H, Qian H (2019) Optimization-based quantification of simulation input uncertainty via empirical likelihood. Preprint, submitted February 13, https://arxiv.org/abs/1707.05917.Google Scholar
  • Lam H, Zhou E (2017) The empirical likelihood approach to quantifying uncertainty in sample average approximation. Oper. Res. Lett. 45(4):301–307.CrossrefGoogle Scholar
  • Liu F, Tang R, Li X, Ye Y, Chen H, Guo H, Zhang Y (2018a) Deep reinforcement learning based recommendation with explicit user-item interactions modeling. Preprint, submitted October 29, https://arxiv.org/abs/1810.12027.Google Scholar
  • Liu Q, Li L, Tang Z, Zhou D (2018b) Breaking the curse of horizon: Infinite-horizon off-policy estimation. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 5356–5366.Google Scholar
  • Mandel T, Liu YE, Levine S, Brunskill E, Popovic Z (2014) Offline policy evaluation across representations with applications to educational games. Proc. 13th Internat. Conf. on Autonomous Agents and Multiagent Systems, vol. 13 (International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC), 1077–1084.Google Scholar
  • Mannor S, Mebel O, Xu H (2016) Robust MDPs with k-rectangular uncertainty. Math. Oper. Res. 41:1484–1509.LinkGoogle Scholar
  • Matsushima T, Furuta H, Matsuo Y, Nachum O, Gu S (2021) Deployment-efficient reinforcement learning via model-based offline optimization. Proc. Internat. Conf. on Learn. Representations.Google Scholar
  • Mei S, Misiakiewicz T, Montanari A (2019) Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. Beygelzimer A, Hsu D, eds. Proc. 32nd Conf. on Learn. Theory, vol. 99 (PMLR, New York), 2388–2464.Google Scholar
  • Mei S, Montanari A, Nguyen PM (2018) A mean field view of the landscape of two-layer neural networks. Proc. National Acad. Sci. USA 115(33):7665–7671.CrossrefGoogle Scholar
  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.CrossrefGoogle Scholar
  • Mousavi A, Li L, Liu Q, Zhou D (2020) Black-box off-policy estimation for infinite-horizon reinforcement learning. Proc. Internat. Conf. on Learn. Representations.Google Scholar
  • Munos R (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations Trends Machine Learn. 7(1):1–129.CrossrefGoogle Scholar
  • Nachum O, Chow Y, Dai B, Li L (2019) Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 2315–2325.Google Scholar
  • Nilim A, El Ghaoui L (2005) Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5):780–798.LinkGoogle Scholar
  • Open AI, Berner C, Brockman G, Chan B, Cheung V, Dębiak P, Dennison C, et al. (2019) Dota 2 with large scale deep reinforcement learning. Preprint, submitted December 13, https://arxiv.org/abs/1912.06680.Google Scholar
  • Petrik M, Russel RH (2019) Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 7049–7058.Google Scholar
  • Precup D (2000) Eligibility traces for off-policy policy evaluation. Langley P, ed. Proc. Internat. Conf. on Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 759–766.Google Scholar
  • Puterman ML (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).CrossrefGoogle Scholar
  • Raghu A, Komorowski M, Ahmed I, Celi L, Szolovits P, Ghassemi M (2017) Deep reinforcement learning for sepsis treatment. Preprint, submitted November 27, https://arxiv.org/abs/1711.09602.Google Scholar
  • Sallab AE, Abdou M, Perot E, Yogamani SK (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017(19):70–76.CrossrefGoogle Scholar
  • Shi C, Zhang S, Lu W, Song R (2020) Statistical inference of the value function for reinforcement learning in infinite horizon settings. Preprint, submitted January 13, https://arxiv.org/abs/2001.04515.Google Scholar
  • Si N, Zhang F, Zhou Z, Blanchet J (2020) Distributionally robust policy evaluation and learning in offline contextual bandits. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 8884–8894.Google Scholar
  • Singh S, Póczos B (2019) Minimax distribution estimation in Wasserstein distance. Preprint, submitted November 7, https://arxiv.org/abs/1802.08855.Google Scholar
  • Sirignano J, Spiliopoulos K (2020) Mean field analysis of neural networks: A law of large numbers. SIAM J. Appl. Math. 80(2):725–752.CrossrefGoogle Scholar
  • Smirnova E, Dohmatob E, Mary J (2019) Distributionally robust reinforcement learning. Preprint, submitted June 14, https://arxiv.org/abs/1902.08708.Google Scholar
  • Sonabend A, Lu J, Celi LA, Cai T, Szolovits P (2020) Expert-supervised reinforcement learning for offline policy learning and evaluation. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 18967–18977.Google Scholar
  • Song J, Zhao C (2020) Optimistic distributionally robust policy optimization. Preprint, submitted June 14, https://arxiv.org/abs/2006.07815.Google Scholar
  • Sun R (2019) Optimization for deep learning: Theory and algorithms. Preprint, submitted December 19, https://arxiv.org/abs/1912.08957.Google Scholar
  • Tang Z, Feng Y, Li L, Zhou D, Liu Q (2019) Doubly robust bias reduction in infinite horizon off-policy estimation. Proc. Internat. Conf. on Learn. Representations.Google Scholar
  • Thomas PS, Theocharous G, Ghavamzadeh M (2015) High-confidence off-policy evaluation. Bonet B, Koenig S, eds. Proc. AAAI Conf. on Artificial Intelligence, vol. 29 (AAAI Press, San Francisco, CA), 3000–3006.Google Scholar
  • Thomas PS, Theocharous G, Ghavamzadeh M, Durugkar I, Brunskill E (2017) Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. Singh S, Markovitch S, eds. Proc. AAAI Conf. on Artificial Intelligence, vol. 31 (AAAI Press, San Francisco, CA), 4740–4745.Google Scholar
  • Tirinzoni A, Chen X, Petrik M, Ziebart BD (2018) Policy-conditioned uncertainty sets for robust Markov decision processes. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 8953–8963.Google Scholar
  • Uehara M, Huang J, Jiang N (2020a) Minimax weight and q-function learning for off-policy evaluation. Meila M, Zhang T, eds. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 9659–9668.Google Scholar
  • Uehara M, Kato M, Yasui S (2020b) Off-policy evaluation and learning for external validity under a covariate shift. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 49–61.Google Scholar
  • Wang L, Zhang W, He X, Zha H (2018) Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. Proc. 24th ACM SIGKDD Internat. Conf. on Knowledge Discovery & Data Mining, 2447–2456.Google Scholar
  • Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math. Oper. Res. 38:153–183.LinkGoogle Scholar
  • Xie T, Ma Y, Wang YX (2019) Toward optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 9668–9678.Google Scholar
  • Xu H, Mannor S (2010) Distributionally robust Markov decision processes. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 2505–2513.Google Scholar
  • Yang I (2017) A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Lett. 1(1):164–169.CrossrefGoogle Scholar
  • Yang M, Nachum O, Dai B, Li L, Schuurmans D (2020) Off-policy evaluation via the regularized Lagrangian. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 6551–6561.Google Scholar
  • Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 14129–14142.Google Scholar
  • Zhang R, Dai B, Li L, Schuurmans D (2020a) Gendice: Generalized offline estimation of stationary values. Proc. Internat. Conf. on Learn. Representations.Google Scholar
  • Zhang Y, Cai Q, Yang Z, Chen Y, Wang Z (2020b) Can temporal-difference and q-learning learn representation? A mean-field theory. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 19680–19692.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.