Nonstationary A/B Tests: Optimal Variance Reduction, Bias Correction, and Valid Inference

Published Online:https://doi.org/10.1287/mnsc.2022.01205

References

  • Abadie A, Diamond A, Hainmueller J (2010) Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. J. Amer. Statist. Assoc. 105(490):493–505.CrossrefGoogle Scholar
  • Abbasi-Yadkori Y, Bartlett P, Gabillon V, Malek A, Valko M (2018) Best of both worlds: Stochastic & adversarial best-arm identification. Lawrence N, ed. Proc. Conf. Learn. Theory (PMLR, Cambridge), 918–949.Google Scholar
  • Alban A, Chick SE, Zoumpoulis SI (2021) Expected value of information methods for contextual ranking and selection: Clinical trials and simulation optimization. Proc. Winter Simulation Conf. (IEEE, New York), 1–12.Google Scholar
  • Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer Science & Business Media, Boston).CrossrefGoogle Scholar
  • Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–973.CrossrefGoogle Scholar
  • Bertsimas D, Johnson M, Kallus N (2015) The power of optimization over randomization in designing experiments involving small samples. Oper. Res. 63(4):868–876.LinkGoogle Scholar
  • Bhat N, Farias VF, Moallemi CC, Sinha D (2020) Near-optimal ab testing. Management Sci. 66(10):4477–4495.LinkGoogle Scholar
  • Blyth CR (1972) On simpson’s paradox and the sure-thing principle. J. Amer. Statist. Assoc. 67(338):364–366.CrossrefGoogle Scholar
  • Bojinov I, Simchi-Levi D, Zhao J (2023) Design and analysis of switchback experiments. Management Sci. 69(7):3759–3777.LinkGoogle Scholar
  • Cheung WC, Simchi-Levi D, Zhu R (2022) Hedging the drift: Learning to optimize under nonstationarity. Management Sci. 68(3):1696–1713.LinkGoogle Scholar
  • Chick SE, Frazier P (2012) Sequential sampling with economics of selection procedures. Management Sci. 58(3):550–569.LinkGoogle Scholar
  • Chick SE, Inoue K (2001) New two-stage and sequential procedures for selecting the best simulated system. Oper. Res. 49(5):732–743.LinkGoogle Scholar
  • Chick SE, Branke J, Schmidt C (2010) Sequential sampling to myopically maximize the expected value of information. INFORMS J. Comput. 22(1):71–80.LinkGoogle Scholar
  • Chick SE, Gans N, Yapar Ö (2022) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. Management Sci. 68(7):4919–4938.LinkGoogle Scholar
  • Deng A, Xu Y, Kohavi R, Walker T (2013) Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. Proc. 6th ACM Internat. Conf. Web Search Data Mining (ACM, New York), 123–132.Google Scholar
  • Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.CrossrefGoogle Scholar
  • Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4):599–613.LinkGoogle Scholar
  • Glasserman P, Yao DD (1992) Some guidelines and guarantees for common random numbers. Management Sci. 38(6):884–908.LinkGoogle Scholar
  • Gupta S, Kohavi R, Tang D, Xu Y, Andersen R, Bakshy E, Cardin N, et al. (2019) Top challenges from the first practical online controlled experiments summit. SIGKDD Exploration 21(1):20–35.CrossrefGoogle Scholar
  • Hahn J (1998) On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66(2):315–331.CrossrefGoogle Scholar
  • Hirano K, Imbens GW, Ridder G (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189.CrossrefGoogle Scholar
  • Holtz D, Aral S (2020) Limiting bias from test-control interference in online marketplace experiments. Preprint, submitted May 20, https://dx.doi.org/10.2139/ssrn.3583596.Google Scholar
  • Jamieson K, Talwalkar A (2016) Non-stochastic best arm identification and hyperparameter optimization. Artificial Intelligence and Statistics (PMLR, Cambridge), 240–248.Google Scholar
  • Johari R, Koomen P, Pekelis L, Walsh D (2017) Peeking at a/b tests: Why it matters, and what to do about it. Proc. 23rd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1517–1525.Google Scholar
  • Johari R, Koomen P, Pekelis L, Walsh D (2022a) Always valid inference: Continuous monitoring of a/b tests. Oper. Res. 70(3):1806–1821.LinkGoogle Scholar
  • Johari R, Li H, Liskovich I, Weintraub GY (2022b) Experimental design in two-sided platforms: An analysis of bias. Management Sci. 68(10):7069–7089.LinkGoogle Scholar
  • Kato M, Ariu K (2021) The role of contextual information in best arm identification. Preprint, submitted June 26, https://arxiv.org/abs/2106.14077.Google Scholar
  • Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
  • Kohavi R, Longbotham R (2017) Online controlled experiments and a/b testing. Encyclopedia Machine Learn. Data Mining 7(8):922–929.CrossrefGoogle Scholar
  • Kohavi R, Tang D, Xu Y (2020) Trustworthy Online Controlled Experiments: A Practical Guide to a/b Testing (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Kohavi R, Deng A, Frasca B, Walker T, Xu Y, Pohlmann N (2013) Online controlled experiments at large scale. Proc. 19th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1168–1176.Google Scholar
  • Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Li X, Ding P (2020) Rerandomization and regression adjustment. J. Roy. Statist. Soc. Ser. B Statist. Methodology 82(1):241–268.CrossrefGoogle Scholar
  • Li W, Chen N, Hong LJ (2019) A dimension-free algorithm for contextual continuum-armed bandits. Preprint, submitted July 15, https://arxiv.org/abs/1907.06550.Google Scholar
  • Li H, Zhao G, Johari R, Weintraub GY (2021) Interference, bias, and variance in two-sided marketplace experimentation: Guidance for platforms. Preprint, submitted April 25, https://arxiv.org/abs/2104.12222.Google Scholar
  • Lin W (2013) Agnostic notes on regression adjustments to experimental data: Reexamining freedman’s critique. Ann. Appl. Statist. 7(1):295–318.CrossrefGoogle Scholar
  • Miratrix LW, Sekhon JS, Yu B (2013) Adjusting treatment effect estimates by post-stratification in randomized experiments. J. Roy. Statist. Soc. Ser. B Statist. Methodology 75(2):369–396.CrossrefGoogle Scholar
  • Newey WK (1990) Semiparametric efficiency bounds. J. Appl. Econometrics 5(2):99–135.CrossrefGoogle Scholar
  • Qin C, Russo D (2022) Adaptivity and confounding in multi-armed bandit experiments. Preprint, submitted February 18, https://arxiv.org/abs/2202.09036.Google Scholar
  • Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J. Ed. Psych. 66(5):688.CrossrefGoogle Scholar
  • Rubin DB (1978) Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6(1):34–58.Google Scholar
  • Russac Y, Katsimerou C, Bohle D, Cappé O, Garivier A, Koolen WM (2021) A/b/n testing with control in the presence of subpopulations. Adv. Neural Inform. Processing Systems 34:25100–25110.Google Scholar
  • Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.LinkGoogle Scholar
  • Scott SL (2010) A modern Bayesian look at the multi-armed bandit. Appl. Stochastic Models Bus. Industry 26(6):639–658.CrossrefGoogle Scholar
  • Shen C (2019) Universal best arm identification. IEEE Trans. Signal Processing 67(17):4464–4478.CrossrefGoogle Scholar
  • Taddy M, Lopes HF, Gardner M (2016) Scalable semiparametric inference for the means of heavy-tailed distributions. Preprint, submitted February 25, https://arxiv.org/abs/1602.08066.Google Scholar
  • Tang D, Agarwal A, O’Brien D, Meyer M (2010) Overlapping experiment infrastructure: More, better, faster experimentation. Proc. 16th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 17–26.Google Scholar
  • Ugander J, Karrer B, Backstrom L, Kleinberg J (2013) Graph cluster randomization: Network exposure to multiple universes. Proc. 19th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 329–337.Google Scholar
  • Xie H, Aurisset J (2016) Improving the sensitivity of online controlled experiments: Case studies at netflix. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 645–654.Google Scholar
  • Zhang T, Blanchet J, Glynn PW (2022) Adaptive stratified sampling with infinitely many strata. Working paper, Stanford University, Palo Alto, CA.Google Scholar
  • Zhao J, Zhou Z (2024) Pigeonhole design: Balancing sequential experiments from an online matching perspective. Management Sci., ePub ahead of print May 24, https://doi.org/10.1287/mnsc.2023.02184.Google Scholar
  • Zheng Z, Glynn PW (2017) A CLT for infinitely stratified estimators, with applications to debiased MLMC. ESAIM Proc. Surveys 59:104–114.CrossrefGoogle Scholar
  • Zhu R, Kveton B (2021) Safe optimal design with applications in policy learning. Preprint, submitted November 10, https://dx.doi.org/10.2139/ssrn.3959086.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.