Reliable Off-Policy Evaluation for Reinforcement Learning

Jie Wang
Jie Wang
[email protected]
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China;
Search for more papers by this author
,
Rui Gao
Corresponding Author
Rui Gao
[email protected]
https://orcid.org/0000-0003-0145-8577
Department of Information, Risk and Operations Management, The University of Texas at Austin, Austin, Texas 78705;
Search for more papers by this author
,
Hongyuan Zha
Hongyuan Zha
[email protected]
School of Data Science, Shenzhen Institute of Artificial Intelligence and Robotics for Society, The Chinese University of Hong Kong, Shenzhen 518172, China
Search for more papers by this author

Jie Wang

[email protected]

School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China;

Search for more papers by this author

Rui Gao

Corresponding Author

Rui Gao

[email protected]

https://orcid.org/0000-0003-0145-8577

Department of Information, Risk and Operations Management, The University of Texas at Austin, Austin, Texas 78705;

Search for more papers by this author

Hongyuan Zha

[email protected]

School of Data Science, Shenzhen Institute of Artificial Intelligence and Robotics for Society, The Chinese University of Hong Kong, Shenzhen 518172, China

Search for more papers by this author

Published Online:11 Oct 2022https://doi.org/10.1287/opre.2022.2382

References

Abdullah MA, Ren H, Ammar HB, Milenkovic V, Luo R, Zhang M, Wang J (2019) Wasserstein robust reinforcement learning. Preprint, submitted September 16, https://arxiv.org/abs/1907.13196.Google Scholar
Ambrosio L, Gigli N, Savaré G (2008) Gradient Flows: In Metric Spaces and in the Space of Probability Measures (Springer Science & Business Media, Boston, MA).Google Scholar
Blanchet J, Murthy K (2019) Quantifying distributional model risk via optimal transport. Math. Oper. Res. 44(2):565–600.Link, Google Scholar
Buckman J, Gelada C, Bellemare MG (2021) The importance of pessimism in fixed-data set policy optimization. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI).Google Scholar
Chen M, Jiang H, Liao W, Zhao T (2019a) Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 8172–8182.Google Scholar
Chen M, Jiang H, Liao W, Zhao T (2022) Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. Inform. Inference: J. IMA. https://doi.org/10.1093/imaiai/iaac001. Google Scholar
Chen M, Liao W, Zha H, Zhao T (2020a) Statistical guarantees of generative adversarial networks for distribution estimation. Preprint, submitted July 16, https://arxiv.org/abs/2002.03938.Google Scholar
Chen X, Wang L, Hang Y, Ge H, Zha H (2020b) Infinite-horizon off-policy policy evaluation with multiple behavior policies. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI), 1–16.Google Scholar
Chen Z, Yu P, Haskell WB (2019b) Distributionally robust optimization for sequential decision-making. Optimization 68(12):2397–2426.Crossref, Google Scholar
Dai B, Nachum O, Chow Y, Li L, Szepesvari C, Schuurmans D (2020) Coindice: Off-policy confidence interval estimation. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 9398–9411.Google Scholar
Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. on Learn. Representations (ICLR, Appleton, WI).Google Scholar
Duan Y, Jia Z, Wang M (2020) Minimax-optimal off-policy evaluation with linear function approximation. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 2701–2709.Google Scholar
Esfahani PM, Kuhn D (2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Programming 171(1–2):115–166.Crossref, Google Scholar
Farajtabar M, Chow Y, Ghavamzadeh M (2018) More robust doubly robust off-policy evaluation. Dy J, Krause A, eds. Proc. Internat. Conf. on Machine Learn., vol. 80 (PMLR, New York), 1447–1456.Google Scholar
Feng Y, Ren T, Tang Z, Liu Q (2020) Accountable off-policy evaluation with kernel bellman statistics. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 3102–3111.Google Scholar
Gao R, Chen X, Kleywegt AJ (2020) Wasserstein distributionally robust optimization and variation regularization. Preprint, submitted October 30, https://arxiv.org/abs/1712.06050.Google Scholar
Gao R, Kleywegt A (2022) Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. 48(2):603–655.Link, Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 27 (Curran Associates, Inc., Red Hook, NY), 2672–2680.Google Scholar
Gottesman O, Johansson F, Komorowski M, Faisal A, Sontag D, Doshi-Velez F, Celi LA (2019) Guidelines for reinforcement learning in healthcare. Nature Medicine 25(1):16–18.Crossref, Google Scholar
Goyal V, Grand-Clement J (2022) Robust Markov decision processes: Beyond rectangularity. Math. Oper. Res. 48(1):203–226.Link, Google Scholar
Hanin B (2019) Universal function approximation by deep neural nets with bounded width and ReLU activations. Mathematics 7(10).Crossref, Google Scholar
Hanna JP, Stone P, Niekum S (2017) Bootstrapping with models: Confidence intervals for off-policy evaluation. Singh S, Markovitch S, eds. Proc. 31st AAAI Conf. on Artificial Intelligence, vol. 31 (AAAI Press, Palo Alto, CA), 4933–4934.Google Scholar
Ho CP, Petrik M, Wiesemann W (2021) Partial policy iteration for l1-robust Markov decision processes. J. Machine Learn. Res. 22(275):1–46.Google Scholar
Hou L, Pang L, Hong X, Lan Y, Ma Z, Yin D (2020) Robust reinforcement learning with Wasserstein constraint. Preprint, submitted June 1, https://arxiv.org/abs/2006.00945.Google Scholar
Iyengar GN (2005) Robust dynamic programming. Math. Oper. Res. 30(2):257–280.Link, Google Scholar
Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 8580–8589. Google Scholar
Jiang N, Huang J (2020) Minimax value interval for off-policy evaluation and policy optimization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 2747–2758.Google Scholar
Jiang N, Li L (2016) Doubly robust off-policy value evaluation for reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. Internat. Conf. on Machine Learn., vol. 48 (PMLR, New York), 652–661.Google Scholar
Jin Y, Yang Z, Wang Z (2021) Is pessimism provably efficient for offline RL? Meila M, Zhang T, eds. Proc. Internat. Conf. on Machine Learn., vol. 139 (PMLR, New York), 5084–5096.Google Scholar
Kallus N, Uehara M (2020) Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Machine Learn. Res. 21:1–63.Google Scholar
Kallus N, Uehara M (2022) Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. Oper. Res. 70(6):3282–3302.Link, Google Scholar
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T (2020) Morel: Model-based offline reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 21810–21823.Google Scholar
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: A survey. Internat. J. Robot. Res. 32(11):1238–1274.Crossref, Google Scholar
Kostrikov I, Nachum O (2020) Statistical bootstrapping for uncertainty estimation in off-policy evaluation. Preprint, submitted July 27, https://arxiv.org/abs/2007.13609.Google Scholar
Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 1179–1191.Google Scholar
Lam H, Qian H (2019) Optimization-based quantification of simulation input uncertainty via empirical likelihood. Preprint, submitted February 13, https://arxiv.org/abs/1707.05917.Google Scholar
Lam H, Zhou E (2017) The empirical likelihood approach to quantifying uncertainty in sample average approximation. Oper. Res. Lett. 45(4):301–307.Crossref, Google Scholar
Liu F, Tang R, Li X, Ye Y, Chen H, Guo H, Zhang Y (2018a) Deep reinforcement learning based recommendation with explicit user-item interactions modeling. Preprint, submitted October 29, https://arxiv.org/abs/1810.12027.Google Scholar
Liu Q, Li L, Tang Z, Zhou D (2018b) Breaking the curse of horizon: Infinite-horizon off-policy estimation. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 5356–5366.Google Scholar
Mandel T, Liu YE, Levine S, Brunskill E, Popovic Z (2014) Offline policy evaluation across representations with applications to educational games. Proc. 13th Internat. Conf. on Autonomous Agents and Multiagent Systems, vol. 13 (International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC), 1077–1084.Google Scholar
Mannor S, Mebel O, Xu H (2016) Robust MDPs with k-rectangular uncertainty. Math. Oper. Res. 41:1484–1509.Link, Google Scholar
Matsushima T, Furuta H, Matsuo Y, Nachum O, Gu S (2021) Deployment-efficient reinforcement learning via model-based offline optimization. Proc. Internat. Conf. on Learn. Representations.Google Scholar
Mei S, Misiakiewicz T, Montanari A (2019) Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. Beygelzimer A, Hsu D, eds. Proc. 32nd Conf. on Learn. Theory, vol. 99 (PMLR, New York), 2388–2464.Google Scholar
Mei S, Montanari A, Nguyen PM (2018) A mean field view of the landscape of two-layer neural networks. Proc. National Acad. Sci. USA 115(33):7665–7671.Crossref, Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, et al. (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533.Crossref, Google Scholar
Mousavi A, Li L, Liu Q, Zhou D (2020) Black-box off-policy estimation for infinite-horizon reinforcement learning. Proc. Internat. Conf. on Learn. Representations.Google Scholar
Munos R (2014) From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations Trends Machine Learn. 7(1):1–129.Crossref, Google Scholar
Nachum O, Chow Y, Dai B, Li L (2019) Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 2315–2325.Google Scholar
Nilim A, El Ghaoui L (2005) Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5):780–798.Link, Google Scholar
Open AI, Berner C, Brockman G, Chan B, Cheung V, Dębiak P, Dennison C, et al. (2019) Dota 2 with large scale deep reinforcement learning. Preprint, submitted December 13, https://arxiv.org/abs/1912.06680.Google Scholar
Petrik M, Russel RH (2019) Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 7049–7058.Google Scholar
Precup D (2000) Eligibility traces for off-policy policy evaluation. Langley P, ed. Proc. Internat. Conf. on Machine Learn. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 759–766.Google Scholar
Puterman ML (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Raghu A, Komorowski M, Ahmed I, Celi L, Szolovits P, Ghassemi M (2017) Deep reinforcement learning for sepsis treatment. Preprint, submitted November 27, https://arxiv.org/abs/1711.09602.Google Scholar
Sallab AE, Abdou M, Perot E, Yogamani SK (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017(19):70–76.Crossref, Google Scholar
Shi C, Zhang S, Lu W, Song R (2020) Statistical inference of the value function for reinforcement learning in infinite horizon settings. Preprint, submitted January 13, https://arxiv.org/abs/2001.04515.Google Scholar
Si N, Zhang F, Zhou Z, Blanchet J (2020) Distributionally robust policy evaluation and learning in offline contextual bandits. Daumé III H, Singh A, eds. Proc. Internat. Conf. on Machine Learn., vol. 119 (PMLR, New York), 8884–8894.Google Scholar
Singh S, Póczos B (2019) Minimax distribution estimation in Wasserstein distance. Preprint, submitted November 7, https://arxiv.org/abs/1802.08855.Google Scholar
Sirignano J, Spiliopoulos K (2020) Mean field analysis of neural networks: A law of large numbers. SIAM J. Appl. Math. 80(2):725–752.Crossref, Google Scholar
Smirnova E, Dohmatob E, Mary J (2019) Distributionally robust reinforcement learning. Preprint, submitted June 14, https://arxiv.org/abs/1902.08708.Google Scholar
Sonabend A, Lu J, Celi LA, Cai T, Szolovits P (2020) Expert-supervised reinforcement learning for offline policy learning and evaluation. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 18967–18977.Google Scholar
Song J, Zhao C (2020) Optimistic distributionally robust policy optimization. Preprint, submitted June 14, https://arxiv.org/abs/2006.07815.Google Scholar
Sun R (2019) Optimization for deep learning: Theory and algorithms. Preprint, submitted December 19, https://arxiv.org/abs/1912.08957.Google Scholar
Tang Z, Feng Y, Li L, Zhou D, Liu Q (2019) Doubly robust bias reduction in infinite horizon off-policy estimation. Proc. Internat. Conf. on Learn. Representations.Google Scholar
Thomas PS, Theocharous G, Ghavamzadeh M (2015) High-confidence off-policy evaluation. Bonet B, Koenig S, eds. Proc. AAAI Conf. on Artificial Intelligence, vol. 29 (AAAI Press, San Francisco, CA), 3000–3006.Google Scholar
Thomas PS, Theocharous G, Ghavamzadeh M, Durugkar I, Brunskill E (2017) Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. Singh S, Markovitch S, eds. Proc. AAAI Conf. on Artificial Intelligence, vol. 31 (AAAI Press, San Francisco, CA), 4740–4745.Google Scholar
Tirinzoni A, Chen X, Petrik M, Ziebart BD (2018) Policy-conditioned uncertainty sets for robust Markov decision processes. Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 8953–8963.Google Scholar
Uehara M, Huang J, Jiang N (2020a) Minimax weight and q-function learning for off-policy evaluation. Meila M, Zhang T, eds. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 9659–9668.Google Scholar
Uehara M, Kato M, Yasui S (2020b) Off-policy evaluation and learning for external validity under a covariate shift. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 49–61.Google Scholar
Wang L, Zhang W, He X, Zha H (2018) Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. Proc. 24th ACM SIGKDD Internat. Conf. on Knowledge Discovery & Data Mining, 2447–2456.Google Scholar
Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math. Oper. Res. 38:153–183.Link, Google Scholar
Xie T, Ma Y, Wang YX (2019) Toward optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 9668–9678.Google Scholar
Xu H, Mannor S (2010) Distributionally robust Markov decision processes. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 2505–2513.Google Scholar
Yang I (2017) A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Lett. 1(1):164–169.Crossref, Google Scholar
Yang M, Nachum O, Dai B, Li L, Schuurmans D (2020) Off-policy evaluation via the regularized Lagrangian. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 6551–6561.Google Scholar
Yu T, Thomas G, Yu L, Ermon S, Zou JY, Levine S, Finn C, Ma T (2020) Mopo: Model-based offline policy optimization. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 14129–14142.Google Scholar
Zhang R, Dai B, Li L, Schuurmans D (2020a) Gendice: Generalized offline estimation of stationary values. Proc. Internat. Conf. on Learn. Representations.Google Scholar
Zhang Y, Cai Q, Yang Z, Chen Y, Wang Z (2020b) Can temporal-difference and q-learning learn representation? A mean-field theory. Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 19680–19692.Google Scholar

Volume 72, Issue 2

March-April 2024

Pages iii-vi, 425-870, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:January 14, 2021
Accepted:August 21, 2022
Published Online:October 11, 2022

Cite as

Jie Wang, Rui Gao, Hongyuan Zha (2022) Reliable Off-Policy Evaluation for Reinforcement Learning. Operations Research 72(2):699-716.

https://doi.org/10.1287/opre.2022.2382

Keywords

Acknowledgments

The authors thank the referees and the editorial team for extensive feedback in improving this manuscript.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Reliable Off-Policy Evaluation for Reinforcement Learning

References

Volume 72, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News