Post Reinforcement Learning Inference

Vasilis Syrgkanis
Vasilis Syrgkanis
[email protected]
Management Science and Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author
,
Ruohan Zhan
Corresponding Author
Ruohan Zhan
[email protected]
https://orcid.org/0000-0002-3426-2784
UCL School of Management, University College London, London E14 5AA, United Kingdom
Search for more papers by this author

Vasilis Syrgkanis

[email protected]

Management Science and Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Ruohan Zhan

Corresponding Author

Ruohan Zhan

[email protected]

https://orcid.org/0000-0002-3426-2784

UCL School of Management, University College London, London E14 5AA, United Kingdom

Search for more papers by this author

Published Online:24 Dec 2025https://doi.org/10.1287/opre.2024.1019

Abstract

We study estimation and inference using data collected by reinforcement learning (RL) algorithms. These algorithms adaptively experiment by interacting with individual units over multiple stages, updating their strategies based on past outcomes. Our goal is to evaluate a counterfactual policy after data collection and estimate structural parameters, such as dynamic treatment effects, that support credit assignment and quantify the impact of early actions on final outcomes. These parameters can often be defined as solutions to moment equations, motivating moment-based estimation methods developed for static data. In RL settings, however, data are often collected adaptively under nonstationary behavior policies. As a result, standard estimators fail to achieve asymptotic normality due to time-varying variance. We propose a weighted generalized method of moments (GMM) approach that uses adaptive weights to stabilize this variance. We characterize weighting schemes that ensure consistency and asymptotic normality of the weighted GMM estimators, enabling valid hypothesis testing and uniform confidence region construction. Key applications include dynamic treatment effect estimation and dynamic off-policy evaluation.

Funding: V. Syrgkanis was supported by the National Science Foundation [Award IIS-2337916].

Supplemental Material: All supplemental materials, including the code, data, and files required to reproduce the results, are available at https://doi.org/10.1287/opre.2024.1019.

Volume 74, Issue 2

March-April 2026

Pages v-ix, 573-1152, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:May 10, 2024
Accepted:October 31, 2025
Published Online:December 24, 2025

Cite as

Vasilis Syrgkanis, Ruohan Zhan (2025) Post Reinforcement Learning Inference. Operations Research 74(2):917-957.

https://doi.org/10.1287/opre.2024.1019

Keywords

Acknowledgments

The authors thank the area editor, associate editor, and two anonymous reviewers for constructive and insightful comments that improved the paper; Susan Athey, Xiaohong Chen, and other colleagues for valuable discussions and suggestions; and seminar and conference participants, including those at the Markov Decision Process and Reinforcement Learning Workshop at Cambridge, the ESIF Economics and AI+ML Meeting, and the World Congress of the Econometric Society, for comments and feedback.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Post Reinforcement Learning Inference

Abstract

Volume 74, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News