Post Reinforcement Learning Inference

Published Online:https://doi.org/10.1287/opre.2024.1019

References

  • Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 127–135.Google Scholar
  • Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. Machine Learn. Proc. (Elsevier, Amsterdam), 30–37.Google Scholar
  • Barsov S, Ul’yanov VV (1987) Estimates of the proximity of Gaussian measures. Doklady Math. 34:462–466. Google Scholar
  • Bhatia R (2010) Modulus of continuity of the matrix absolute value. Indian J. Pure Appl. Math. 41(1):99–111.CrossrefGoogle Scholar
  • Bibaut A, Dimakopoulou M, Kallus N, Chambaz A, van Der Laan M (2021) Post-contextual-bandit inference. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 28548–28559.Google Scholar
  • Cattaneo MD, Masini RP, Underwood WG (2025) Yurinskii’s coupling for martingales. Annals Statist. 53(5):2179–2203.Google Scholar
  • Chakraborty B, Moodie EE (2013) Semi-parametric estimation of optimal DTRs by modeling contrasts of conditional mean outcomes. Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine (Springer New York, New York), 53–78.CrossrefGoogle Scholar
  • Chen M, Beutel A, Covington P, Jain S, Belletti F, Chi EH (2019) Top-K off-policy correction for a REINFORCE recommender system. Proc. 12th ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 456–464.Google Scholar
  • Chernozhukov V, Escanciano JC, Ichimura H, Newey WK, Robins JM (2022) Locally robust semiparametric estimation. Econometrica 90(4):1501–1535.CrossrefGoogle Scholar
  • Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (JMLR, Norfolk, MA), 208–214.Google Scholar
  • Daskalakis C, Golowich N (2022) Fast rates for nonparametric online learning: From realizability to learning in games. Proc. 54th Annual ACM SIGACT Sympos. Theory Comput. (Association for Computing Machinery, New York), 846–859.Google Scholar
  • Deshpande Y, Mackey L, Syrgkanis V, Taddy M (2018) Accurate inference for adaptive linear models. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1194–1203.Google Scholar
  • Devroye L, Mehrabian A, Reddad T (2018) The total variation distance between high-dimensional Gaussians with the same mean. Preprint, submitted October 19, https://arxiv.org/abs/1810.08693.Google Scholar
  • Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.CrossrefGoogle Scholar
  • Hall P, Heyde CC (2014) Martingale Limit Theory and Its Application (Academic Press, New York).Google Scholar
  • Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.CrossrefGoogle Scholar
  • Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, Burlington, MA), 267–274.Google Scholar
  • Laan MJ, Robins JM (2003) Unified Methods for Censored Longitudinal Data and Causality (Springer, Berlin).CrossrefGoogle Scholar
  • Lei H, Nahum-Shani I, Lynch K, Oslin D, Murphy SA (2012) A “smart” design for building individualized treatment sequences. Annual Rev. Clinical Psych. 8:21–48.CrossrefGoogle Scholar
  • Lewis G, Syrgkanis V (2020) Double/debiased machine learning for dynamic treatment effects via g-estimation. Preprint, submitted February 17, https://arxiv.org/abs/2002.07285.Google Scholar
  • Lok JJ, DeGruttola V (2012) Impact of time to start treatment following infection with application to initiating HAART in HIV-positive patients. Biometrics 68(3):745–754.CrossrefGoogle Scholar
  • Miguel A, Hernan R, James M (2023) Causal Inference: What If (CRC Press, Boca Raton, FL).Google Scholar
  • Murphy SA (2003) Optimal dynamic treatment regimes. J. Roy. Statist. Soc. Ser. B (Statist. Methodology) 65(2):331–355.CrossrefGoogle Scholar
  • Murphy SA (2005) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.CrossrefGoogle Scholar
  • Neu G, Pike-Burke C (2020) A unifying view of optimism in episodic reinforcement learning. Adv. Neural Inform. Processing Systems, vol. 33 (Association for Computing Machinery, New York), 1392–1403.Google Scholar
  • Neyman J (1979) C (α) tests and their use. Sankhyā: Indian J. Statist. Ser. A 41:1–21.Google Scholar
  • Offer-Westort M, Coppock A, Green DP (2021) Adaptive experimental design: Prospects and applications in political science. Amer. J. Political Sci. 65(4):826–844.CrossrefGoogle Scholar
  • Precup D (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 759–766.Google Scholar
  • Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. Conf. Learn. Theory (PMLR, New York), 1232–1264.Google Scholar
  • Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Math. Modeling 7(9–12):1393–1512.CrossrefGoogle Scholar
  • Robins JM (2004) Optimal structural nested models for optimal sequential decisions. Proc. 2nd Seattle Sympos. Biostatist. Analysis Correlated Data (Springer, Berlin), 189–326.Google Scholar
  • Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55.CrossrefGoogle Scholar
  • Russo D (2016) Simple Bayesian algorithms for best arm identification. Proc. Conf. Learn. Theory (PMLR, New York), 1417–1418.Google Scholar
  • Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Shi C, Luo S, Le Y, Zhu H, Song R (2024) Statistically efficient advantage learning for offline reinforcement learning in infinite horizons. J. Amer. Statist. Assoc. 119(545):232–245.CrossrefGoogle Scholar
  • Sutton RS, Barto AG. (1998) Introduction to Reinforcement Learning, vol. 135 (MIT Press, Cambridge, MA).CrossrefGoogle Scholar
  • Vansteelandt S, Sjolander A (2016) Revisiting g-estimation of the effect of a time-varying exposure subject to time-varying confounding. Epidemiology Methods 5(1):37–56.CrossrefGoogle Scholar
  • Zhan R, Hadad V, Hirshberg DA, Athey S (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 2125–2135.Google Scholar
  • Zhang K, Janson L, Murphy S (2021) Statistical inference with m-estimators on adaptively collected data. Adv. Neural Inform. Processing Systems 34:7460–7471.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.