Nonstationary A/B Tests: Optimal Variance Reduction, Bias Correction, and Valid Inference

Yuhang Wu
Yuhang Wu
[email protected]
https://orcid.org/0009-0007-4635-8066
Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720
Search for more papers by this author
,
Zeyu Zheng
Corresponding Author
Zeyu Zheng
[email protected]
https://orcid.org/0000-0001-5653-152X
Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720
Search for more papers by this author
,
Guangyu Zhang
Guangyu Zhang
[email protected]
Amazon.com Inc, Seattle, Washington 98109
Search for more papers by this author
,
Zuohua Zhang
Zuohua Zhang
[email protected]
Amazon.com Inc, Seattle, Washington 98109
Search for more papers by this author
,
Chu Wang
Chu Wang
[email protected]
Amazon.com Inc, Seattle, Washington 98109
Search for more papers by this author

Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720

Search for more papers by this author

Zeyu Zheng

Corresponding Author

Zeyu Zheng

[email protected]

https://orcid.org/0000-0001-5653-152X

Department of Industrial Engineering and Operations Research, University of California, Berkeley, California 94720

Search for more papers by this author

Guangyu Zhang

[email protected]

Amazon.com Inc, Seattle, Washington 98109

Search for more papers by this author

Zuohua Zhang

[email protected]

Amazon.com Inc, Seattle, Washington 98109

Search for more papers by this author

Chu Wang

[email protected]

Amazon.com Inc, Seattle, Washington 98109

Search for more papers by this author

Published Online:18 Sep 2024https://doi.org/10.1287/mnsc.2022.01205

References

Abadie A, Diamond A, Hainmueller J (2010) Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. J. Amer. Statist. Assoc. 105(490):493–505.Crossref, Google Scholar
Abbasi-Yadkori Y, Bartlett P, Gabillon V, Malek A, Valko M (2018) Best of both worlds: Stochastic & adversarial best-arm identification. Lawrence N, ed. Proc. Conf. Learn. Theory (PMLR, Cambridge), 918–949.Google Scholar
Alban A, Chick SE, Zoumpoulis SI (2021) Expected value of information methods for contextual ranking and selection: Clinical trials and simulation optimization. Proc. Winter Simulation Conf. (IEEE, New York), 1–12.Google Scholar
Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, vol. 57 (Springer Science & Business Media, Boston).Crossref, Google Scholar
Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–973.Crossref, Google Scholar
Bertsimas D, Johnson M, Kallus N (2015) The power of optimization over randomization in designing experiments involving small samples. Oper. Res. 63(4):868–876.Link, Google Scholar
Bhat N, Farias VF, Moallemi CC, Sinha D (2020) Near-optimal ab testing. Management Sci. 66(10):4477–4495.Link, Google Scholar
Blyth CR (1972) On simpson’s paradox and the sure-thing principle. J. Amer. Statist. Assoc. 67(338):364–366.Crossref, Google Scholar
Bojinov I, Simchi-Levi D, Zhao J (2023) Design and analysis of switchback experiments. Management Sci. 69(7):3759–3777.Link, Google Scholar
Cheung WC, Simchi-Levi D, Zhu R (2022) Hedging the drift: Learning to optimize under nonstationarity. Management Sci. 68(3):1696–1713.Link, Google Scholar
Chick SE, Frazier P (2012) Sequential sampling with economics of selection procedures. Management Sci. 58(3):550–569.Link, Google Scholar
Chick SE, Inoue K (2001) New two-stage and sequential procedures for selecting the best simulated system. Oper. Res. 49(5):732–743.Link, Google Scholar
Chick SE, Branke J, Schmidt C (2010) Sequential sampling to myopically maximize the expected value of information. INFORMS J. Comput. 22(1):71–80.Link, Google Scholar
Chick SE, Gans N, Yapar Ö (2022) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. Management Sci. 68(7):4919–4938.Link, Google Scholar
Deng A, Xu Y, Kohavi R, Walker T (2013) Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. Proc. 6th ACM Internat. Conf. Web Search Data Mining (ACM, New York), 123–132.Google Scholar
Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.Crossref, Google Scholar
Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4):599–613.Link, Google Scholar
Glasserman P, Yao DD (1992) Some guidelines and guarantees for common random numbers. Management Sci. 38(6):884–908.Link, Google Scholar
Gupta S, Kohavi R, Tang D, Xu Y, Andersen R, Bakshy E, Cardin N, et al. (2019) Top challenges from the first practical online controlled experiments summit. SIGKDD Exploration 21(1):20–35.Crossref, Google Scholar
Hahn J (1998) On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66(2):315–331.Crossref, Google Scholar
Hirano K, Imbens GW, Ridder G (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189.Crossref, Google Scholar
Holtz D, Aral S (2020) Limiting bias from test-control interference in online marketplace experiments. Preprint, submitted May 20, https://dx.doi.org/10.2139/ssrn.3583596.Google Scholar
Jamieson K, Talwalkar A (2016) Non-stochastic best arm identification and hyperparameter optimization. Artificial Intelligence and Statistics (PMLR, Cambridge), 240–248.Google Scholar
Johari R, Koomen P, Pekelis L, Walsh D (2017) Peeking at a/b tests: Why it matters, and what to do about it. Proc. 23rd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1517–1525.Google Scholar
Johari R, Koomen P, Pekelis L, Walsh D (2022a) Always valid inference: Continuous monitoring of a/b tests. Oper. Res. 70(3):1806–1821.Link, Google Scholar
Johari R, Li H, Liskovich I, Weintraub GY (2022b) Experimental design in two-sided platforms: An analysis of bias. Management Sci. 68(10):7069–7089.Link, Google Scholar
Kato M, Ariu K (2021) The role of contextual information in best arm identification. Preprint, submitted June 26, https://arxiv.org/abs/2106.14077.Google Scholar
Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
Kohavi R, Longbotham R (2017) Online controlled experiments and a/b testing. Encyclopedia Machine Learn. Data Mining 7(8):922–929.Crossref, Google Scholar
Kohavi R, Tang D, Xu Y (2020) Trustworthy Online Controlled Experiments: A Practical Guide to a/b Testing (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Kohavi R, Deng A, Frasca B, Walker T, Xu Y, Pohlmann N (2013) Online controlled experiments at large scale. Proc. 19th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1168–1176.Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Li X, Ding P (2020) Rerandomization and regression adjustment. J. Roy. Statist. Soc. Ser. B Statist. Methodology 82(1):241–268.Crossref, Google Scholar
Li W, Chen N, Hong LJ (2019) A dimension-free algorithm for contextual continuum-armed bandits. Preprint, submitted July 15, https://arxiv.org/abs/1907.06550.Google Scholar
Li H, Zhao G, Johari R, Weintraub GY (2021) Interference, bias, and variance in two-sided marketplace experimentation: Guidance for platforms. Preprint, submitted April 25, https://arxiv.org/abs/2104.12222.Google Scholar
Lin W (2013) Agnostic notes on regression adjustments to experimental data: Reexamining freedman’s critique. Ann. Appl. Statist. 7(1):295–318.Crossref, Google Scholar
Miratrix LW, Sekhon JS, Yu B (2013) Adjusting treatment effect estimates by post-stratification in randomized experiments. J. Roy. Statist. Soc. Ser. B Statist. Methodology 75(2):369–396.Crossref, Google Scholar
Newey WK (1990) Semiparametric efficiency bounds. J. Appl. Econometrics 5(2):99–135.Crossref, Google Scholar
Qin C, Russo D (2022) Adaptivity and confounding in multi-armed bandit experiments. Preprint, submitted February 18, https://arxiv.org/abs/2202.09036.Google Scholar
Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J. Ed. Psych. 66(5):688.Crossref, Google Scholar
Rubin DB (1978) Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6(1):34–58.Google Scholar
Russac Y, Katsimerou C, Bohle D, Cappé O, Garivier A, Koolen WM (2021) A/b/n testing with control in the presence of subpopulations. Adv. Neural Inform. Processing Systems 34:25100–25110.Google Scholar
Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
Scott SL (2010) A modern Bayesian look at the multi-armed bandit. Appl. Stochastic Models Bus. Industry 26(6):639–658.Crossref, Google Scholar
Shen C (2019) Universal best arm identification. IEEE Trans. Signal Processing 67(17):4464–4478.Crossref, Google Scholar
Taddy M, Lopes HF, Gardner M (2016) Scalable semiparametric inference for the means of heavy-tailed distributions. Preprint, submitted February 25, https://arxiv.org/abs/1602.08066.Google Scholar
Tang D, Agarwal A, O’Brien D, Meyer M (2010) Overlapping experiment infrastructure: More, better, faster experimentation. Proc. 16th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 17–26.Google Scholar
Ugander J, Karrer B, Backstrom L, Kleinberg J (2013) Graph cluster randomization: Network exposure to multiple universes. Proc. 19th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 329–337.Google Scholar
Xie H, Aurisset J (2016) Improving the sensitivity of online controlled experiments: Case studies at netflix. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 645–654.Google Scholar
Zhang T, Blanchet J, Glynn PW (2022) Adaptive stratified sampling with infinitely many strata. Working paper, Stanford University, Palo Alto, CA.Google Scholar
Zhao J, Zhou Z (2024) Pigeonhole design: Balancing sequential experiments from an online matching perspective. Management Sci., ePub ahead of print May 24, https://doi.org/10.1287/mnsc.2023.02184.Google Scholar
Zheng Z, Glynn PW (2017) A CLT for infinitely stratified estimators, with applications to debiased MLMC. ESAIM Proc. Surveys 59:104–114.Crossref, Google Scholar
Zhu R, Kveton B (2021) Safe optimal design with applications in policy learning. Preprint, submitted November 10, https://dx.doi.org/10.2139/ssrn.3959086.Google Scholar

Volume 71, Issue 6

June 2025

Pages iv-vi, 4533-5418

Article Information

Supplemental Material

Metrics

Information

Received:April 21, 2022
Accepted:January 06, 2024
Published Online:September 18, 2024

Cite as

Yuhang Wu; , Zeyu Zheng; , Guangyu Zhang, Zuohua Zhang, Chu Wang (2024) Nonstationary A/B Tests: Optimal Variance Reduction, Bias Correction, and Valid Inference. Management Science 71(6):4707-4727.

https://doi.org/10.1287/mnsc.2022.01205

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Nonstationary A/B Tests: Optimal Variance Reduction, Bias Correction, and Valid Inference

References

Volume 71, Issue 6

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News