Post Reinforcement Learning Inference

Vasilis Syrgkanis
Vasilis Syrgkanis
[email protected]
Management Science and Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author
,
Ruohan Zhan
Corresponding Author
Ruohan Zhan
[email protected]
https://orcid.org/0000-0002-3426-2784
UCL School of Management, University College London, London E14 5AA, United Kingdom
Search for more papers by this author

Vasilis Syrgkanis

[email protected]

Management Science and Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Ruohan Zhan

Corresponding Author

Ruohan Zhan

[email protected]

https://orcid.org/0000-0002-3426-2784

UCL School of Management, University College London, London E14 5AA, United Kingdom

Search for more papers by this author

Published Online:24 Dec 2025https://doi.org/10.1287/opre.2024.1019

References

Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 127–135.Google Scholar
Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. Machine Learn. Proc. (Elsevier, Amsterdam), 30–37.Google Scholar
Barsov S, Ul’yanov VV (1987) Estimates of the proximity of Gaussian measures. Doklady Math. 34:462–466. Google Scholar
Bhatia R (2010) Modulus of continuity of the matrix absolute value. Indian J. Pure Appl. Math. 41(1):99–111.Crossref, Google Scholar
Bibaut A, Dimakopoulou M, Kallus N, Chambaz A, van Der Laan M (2021) Post-contextual-bandit inference. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 28548–28559.Google Scholar
Cattaneo MD, Masini RP, Underwood WG (2025) Yurinskii’s coupling for martingales. Annals Statist. 53(5):2179–2203.Google Scholar
Chakraborty B, Moodie EE (2013) Semi-parametric estimation of optimal DTRs by modeling contrasts of conditional mean outcomes. Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine (Springer New York, New York), 53–78.Crossref, Google Scholar
Chen M, Beutel A, Covington P, Jain S, Belletti F, Chi EH (2019) Top-K off-policy correction for a REINFORCE recommender system. Proc. 12th ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 456–464.Google Scholar
Chernozhukov V, Escanciano JC, Ichimura H, Newey WK, Robins JM (2022) Locally robust semiparametric estimation. Econometrica 90(4):1501–1535.Crossref, Google Scholar
Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (JMLR, Norfolk, MA), 208–214.Google Scholar
Daskalakis C, Golowich N (2022) Fast rates for nonparametric online learning: From realizability to learning in games. Proc. 54th Annual ACM SIGACT Sympos. Theory Comput. (Association for Computing Machinery, New York), 846–859.Google Scholar
Deshpande Y, Mackey L, Syrgkanis V, Taddy M (2018) Accurate inference for adaptive linear models. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1194–1203.Google Scholar
Devroye L, Mehrabian A, Reddad T (2018) The total variation distance between high-dimensional Gaussians with the same mean. Preprint, submitted October 19, https://arxiv.org/abs/1810.08693.Google Scholar
Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.Crossref, Google Scholar
Hall P, Heyde CC (2014) Martingale Limit Theory and Its Application (Academic Press, New York).Google Scholar
Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.Crossref, Google Scholar
Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, Burlington, MA), 267–274.Google Scholar
Laan MJ, Robins JM (2003) Unified Methods for Censored Longitudinal Data and Causality (Springer, Berlin).Crossref, Google Scholar
Lei H, Nahum-Shani I, Lynch K, Oslin D, Murphy SA (2012) A “smart” design for building individualized treatment sequences. Annual Rev. Clinical Psych. 8:21–48.Crossref, Google Scholar
Lewis G, Syrgkanis V (2020) Double/debiased machine learning for dynamic treatment effects via g-estimation. Preprint, submitted February 17, https://arxiv.org/abs/2002.07285.Google Scholar
Lok JJ, DeGruttola V (2012) Impact of time to start treatment following infection with application to initiating HAART in HIV-positive patients. Biometrics 68(3):745–754.Crossref, Google Scholar
Miguel A, Hernan R, James M (2023) Causal Inference: What If (CRC Press, Boca Raton, FL).Google Scholar
Murphy SA (2003) Optimal dynamic treatment regimes. J. Roy. Statist. Soc. Ser. B (Statist. Methodology) 65(2):331–355.Crossref, Google Scholar
Murphy SA (2005) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.Crossref, Google Scholar
Neu G, Pike-Burke C (2020) A unifying view of optimism in episodic reinforcement learning. Adv. Neural Inform. Processing Systems, vol. 33 (Association for Computing Machinery, New York), 1392–1403.Google Scholar
Neyman J (1979) C (α) tests and their use. Sankhyā: Indian J. Statist. Ser. A 41:1–21.Google Scholar
Offer-Westort M, Coppock A, Green DP (2021) Adaptive experimental design: Prospects and applications in political science. Amer. J. Political Sci. 65(4):826–844.Crossref, Google Scholar
Precup D (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 759–766.Google Scholar
Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. Conf. Learn. Theory (PMLR, New York), 1232–1264.Google Scholar
Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Math. Modeling 7(9–12):1393–1512.Crossref, Google Scholar
Robins JM (2004) Optimal structural nested models for optimal sequential decisions. Proc. 2nd Seattle Sympos. Biostatist. Analysis Correlated Data (Springer, Berlin), 189–326.Google Scholar
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55.Crossref, Google Scholar
Russo D (2016) Simple Bayesian algorithms for best arm identification. Proc. Conf. Learn. Theory (PMLR, New York), 1417–1418.Google Scholar
Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Shi C, Luo S, Le Y, Zhu H, Song R (2024) Statistically efficient advantage learning for offline reinforcement learning in infinite horizons. J. Amer. Statist. Assoc. 119(545):232–245.Crossref, Google Scholar
Sutton RS, Barto AG. (1998) Introduction to Reinforcement Learning, vol. 135 (MIT Press, Cambridge, MA).Crossref, Google Scholar
Vansteelandt S, Sjolander A (2016) Revisiting g-estimation of the effect of a time-varying exposure subject to time-varying confounding. Epidemiology Methods 5(1):37–56.Crossref, Google Scholar
Zhan R, Hadad V, Hirshberg DA, Athey S (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 2125–2135.Google Scholar
Zhang K, Janson L, Murphy S (2021) Statistical inference with m-estimators on adaptively collected data. Adv. Neural Inform. Processing Systems 34:7460–7471.Google Scholar

Volume 74, Issue 2

March-April 2026

Pages v-ix, 573-1152, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:May 10, 2024
Accepted:October 31, 2025
Published Online:December 24, 2025

Cite as

Vasilis Syrgkanis, Ruohan Zhan (2025) Post Reinforcement Learning Inference. Operations Research 74(2):917-957.

https://doi.org/10.1287/opre.2024.1019

Keywords

Acknowledgments

The authors thank the area editor, associate editor, and two anonymous reviewers for constructive and insightful comments that improved the paper; Susan Athey, Xiaohong Chen, and other colleagues for valuable discussions and suggestions; and seminar and conference participants, including those at the Markov Decision Process and Reinforcement Learning Workshop at Cambridge, the ESIF Economics and AI+ML Meeting, and the World Congress of the Econometric Society, for comments and feedback.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Post Reinforcement Learning Inference

References

Volume 74, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News