Post Reinforcement Learning Inference
References
- (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 127–135.Google Scholar
- (1995) Residual algorithms: Reinforcement learning with function approximation. Machine Learn. Proc. (Elsevier, Amsterdam), 30–37.Google Scholar
- (1987) Estimates of the proximity of Gaussian measures. Doklady Math. 34:462–466. Google Scholar
- (2010) Modulus of continuity of the matrix absolute value. Indian J. Pure Appl. Math. 41(1):99–111.Crossref, Google Scholar
- (2021) Post-contextual-bandit inference. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 28548–28559.Google Scholar
- , Underwood WG (2025) Yurinskii’s coupling for martingales. Annals Statist. 53(5):2179–2203.Google Scholar
- (2013) Semi-parametric estimation of optimal DTRs by modeling contrasts of conditional mean outcomes. Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine (Springer New York, New York), 53–78.Crossref, Google Scholar
- (2019) Top-K off-policy correction for a REINFORCE recommender system. Proc. 12th ACM Internat. Conf. Web Search Data Mining (Association for Computing Machinery, New York), 456–464.Google Scholar
- (2022) Locally robust semiparametric estimation. Econometrica 90(4):1501–1535.Crossref, Google Scholar
- (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (JMLR, Norfolk, MA), 208–214.Google Scholar
- (2022) Fast rates for nonparametric online learning: From realizability to learning in games. Proc. 54th Annual ACM SIGACT Sympos. Theory Comput. (Association for Computing Machinery, New York), 846–859.Google Scholar
- (2018) Accurate inference for adaptive linear models. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1194–1203.Google Scholar
- (2018) The total variation distance between high-dimensional Gaussians with the same mean. Preprint, submitted October 19, https://arxiv.org/abs/1810.08693.Google Scholar
- (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.Crossref, Google Scholar
- (2014) Martingale Limit Theory and Its Application (Academic Press, New York).Google Scholar
- (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.Crossref, Google Scholar
- (2002) Approximately optimal approximate reinforcement learning. Proc. 19th Internat. Conf. Machine Learn. (Morgan Kaufmann Publishers, Burlington, MA), 267–274.Google Scholar
- (2003) Unified Methods for Censored Longitudinal Data and Causality (Springer, Berlin).Crossref, Google Scholar
- (2012) A “smart” design for building individualized treatment sequences. Annual Rev. Clinical Psych. 8:21–48.Crossref, Google Scholar
- (2020) Double/debiased machine learning for dynamic treatment effects via g-estimation. Preprint, submitted February 17, https://arxiv.org/abs/2002.07285.Google Scholar
- (2012) Impact of time to start treatment following infection with application to initiating HAART in HIV-positive patients. Biometrics 68(3):745–754.Crossref, Google Scholar
- (2023) Causal Inference: What If (CRC Press, Boca Raton, FL).Google Scholar
- (2003) Optimal dynamic treatment regimes. J. Roy. Statist. Soc. Ser. B (Statist. Methodology) 65(2):331–355.Crossref, Google Scholar
- (2005) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.Crossref, Google Scholar
- (2020) A unifying view of optimism in episodic reinforcement learning. Adv. Neural Inform. Processing Systems, vol. 33 (Association for Computing Machinery, New York), 1392–1403.Google Scholar
- (1979) C (α) tests and their use. Sankhyā: Indian J. Statist. Ser. A 41:1–21.Google Scholar
- (2021) Adaptive experimental design: Prospects and applications in political science. Amer. J. Political Sci. 65(4):826–844.Crossref, Google Scholar
- (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. (Morgan Kaufmann Publishers Inc., San Francisco, CA), 759–766.Google Scholar
- (2014) Online non-parametric regression. Proc. Conf. Learn. Theory (PMLR, New York), 1232–1264.Google Scholar
- (1986) A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Math. Modeling 7(9–12):1393–1512.Crossref, Google Scholar
- (2004) Optimal structural nested models for optimal sequential decisions. Proc. 2nd Seattle Sympos. Biostatist. Analysis Correlated Data (Springer, Berlin), 189–326.Google Scholar
- (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55.Crossref, Google Scholar
- (2016) Simple Bayesian algorithms for best arm identification. Proc. Conf. Learn. Theory (PMLR, New York), 1417–1418.Google Scholar
- (2014) Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2024) Statistically efficient advantage learning for offline reinforcement learning in infinite horizons. J. Amer. Statist. Assoc. 119(545):232–245.Crossref, Google Scholar
- . (1998) Introduction to Reinforcement Learning, vol. 135 (MIT Press, Cambridge, MA).Crossref, Google Scholar
- (2016) Revisiting g-estimation of the effect of a time-varying exposure subject to time-varying confounding. Epidemiology Methods 5(1):37–56.Crossref, Google Scholar
- (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 2125–2135.Google Scholar
- (2021) Statistical inference with m-estimators on adaptively collected data. Adv. Neural Inform. Processing Systems 34:7460–7471.Google Scholar

