An Investigation of p-Hacking in E-Commerce A/B Testing

Alex P. Miller
Corresponding Author
Alex P. Miller
[email protected]
https://orcid.org/0000-0003-3535-2578
Marshall School of Business, University of Southern California, Los Angeles, California 90089
Search for more papers by this author
,
Kartik Hosanagar
Kartik Hosanagar
[email protected]
https://orcid.org/0000-0002-6442-9434
The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Search for more papers by this author

Alex P. Miller

Corresponding Author

Alex P. Miller

[email protected]

https://orcid.org/0000-0003-3535-2578

Marshall School of Business, University of Southern California, Los Angeles, California 90089

Search for more papers by this author

Kartik Hosanagar

[email protected]

https://orcid.org/0000-0002-6442-9434

The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104

Search for more papers by this author

Published Online:30 Jan 2025https://doi.org/10.1287/isre.2024.0872

References

Abhishek V, Mannor S (2017) A nonparametric sequential test for online randomized experiments. Proc. 26th Internat. Conf. World Wide Web Companion (International World Wide Web Conferences Steering Committee, Geneva), 610–616Google Scholar
Allison DB, Gadbury GL, Heo M, Fernández JR, Lee CK, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput. Statist. Data Anal. 39(1):1–20.Crossref, Google Scholar
Andrews I, Kasy M (2019) Identification of and correction for publication bias. Amer. Econom. Rev. 109(8):2766–2794.Crossref, Google Scholar
Anscombe FJ (1954) Fixed-sample-size analysis of sequential observations. Biometrics 10(1):89–100.Crossref, Google Scholar
Azevedo EM, Deng A, Montiel Olea JL, Rao J, Weyl EG (2020) A/B testing with fat tails. J. Political Econom. 128(12):4614.Google Scholar
Baker GP (1992) Incentive contracts and performance measurement. J. Political Econom. 100(3):598–614.Crossref, Google Scholar
Berman R, Van den Bulte C (2022) False discovery in A/B testing. Management Sci. 68(9):6762–6782.Link, Google Scholar
Berman R, Pekelis L, Scott A, Van den Bulte C (2018) p-Hacking and false discovery in A/B testing. Preprint, submitted July 18, http://dx.doi.org/10.2139/ssrn.3204791.Google Scholar
Borden P (2014) How Optimizely (almost) got me fired. Accessed June 8, 2018, https://web.archive.org/web/20180608142925/https://blog.sumall.com/journal/optimizely-got-me-fired.html.Google Scholar
Bothwell LE, Podolsky SH (2016) The emergence of the randomized, controlled trial. New England J. Medicine 375(6):501–504.Google Scholar
Brodeur A, Cook N, Heyes A (2020) Methods matter: P-hacking and publication bias in causal analysis in economics. Amer. Econom. Rev. 110(11):3634–3660.Crossref, Google Scholar
Brodeur A, Cook N, Heyes A (2022) We need to talk about Mechanical Turk: What 22,989 hypothesis tests tell us about publication bias and p-hacking in online experiments. Preprint, submitted August 12, http://dx.doi.org/10.2139/ssrn.4188289.Google Scholar
Brodeur A, Carrell S, Figlio D, Lusher L (2023) Unpacking p-hacking and publication bias. Amer. Econom. Rev. 113(11):2974–3002.Crossref, Google Scholar
Brynjolfsson E, McElheran K (2016) The rapid adoption of data-driven decision-making. Amer. Econom. Rev. 106(5):133–139.Crossref, Google Scholar
BuiltWith (2019) A/B testing usage distribution in the top 1 million sites. Accessed July 17, 2019, https://web.archive.org/web/20190717062204/https://trends.builtwith.com/analytics/a-b-testing.Google Scholar
Cattaneo MD, Jansson M, Ma X (2018) Manipulation testing based on density discontinuity. Stata J. 18(1):234–261.Crossref, Google Scholar
Chickering DM, Heckerman D (2000) A decision theoretic approach to targeted advertising. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 81–88.Google Scholar
Christian B (2012a) The A/B test: Inside the technology that’s changing the rules of business. WIRED (April 25), https://www.wired.com/2012/04/ff-abtesting/.Google Scholar
Christian B (2012b) Test everything: Notes on the A/B revolution. WIRED (May 9), https://www.wired.com/2012/05/test-everything/.Google Scholar
Deng A, Lu J, Chen S (2016) Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. 2016 IEEE Internat. Conf. Data Sci. Advanced Anal. (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 243–252.Google Scholar
Draper P (2016) The fatal flaw of A/B tests: Peeking. Accessed July 17, 2019, https://www.lucidchart.com/blog/the-fatal-flaw-of-ab-tests-peeking.Google Scholar
Dreber A, Johannesson M (2019) Statistical Significance and the Replication Crisis in the Social Sciences. Oxford Research Encyclopedia of Economics and Finance (Oxford University Press, Oxford, UK).Google Scholar
Earp BD, Trafimow D (2015) Replication, falsification, and the crisis of confidence in social psychology. Frontiers Psych. 6:621.Google Scholar
Feng E (2017) Building an intelligent experimentation platform with Uber Engineering. Accessed July 21, 2019, https://eng.uber.com/experimentation-platform/.Google Scholar
Fisher RA (1925) Statistical Methods for Research Workers (Oliver & Boyd, Edinburgh, Scotland).Google Scholar
Fiske DW, Jones LV (1954) Sequential analysis in psychological research. Psych. Bull. 51(3):264–275.Crossref, Google Scholar
Flory J (2021) The top 3 mistakes that make your A/B test results invalid. Accessed September 1, 2021, https://conversion.com/blog/3-mistakes-invalidate-ab-test-results/.Google Scholar
Gamber T (2019) Making sense of A/B testing statistics. Accessed September 1, 2021, https://www.confidenceinterval.com/blog/making-sense-of-ab-testing-statistics/.Google Scholar
Gelman A, Loken E (2013) The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University 348.1–17:3.Google Scholar
Gerber A, Malhotra N (2008a) Do statistical reporting standards affect what is published? Publication bias in two leading political science journals. Quart. J. Political Sci. 3(3):313–326.Crossref, Google Scholar
Gerber AS, Malhotra N (2008b) Publication bias in empirical sociological research: Do arbitrary significance levels distort published results? Sociol. Methods Res. 37(1):3–30.Crossref, Google Scholar
Ghosh S, Thomke S, Pourkhalkhali H (2020) The effects of hierarchy on learning and performance in business experimentation. Acad. Management Proc. 2020(1):20500.Crossref, Google Scholar
Hall TA, Hasan S (2020) The politics of experimentation. Preprint, submitted April 13, http://dx.doi.org/10.2139/ssrn.3571296.Google Scholar
Hamermesh DS (2013) Six decades of top economics publishing: Who and how? J. Econom. Lit. 51(1):162–172.Crossref, Google Scholar
Hern A (2014) Why Google has 200m reasons to put engineers over designers. Guardian (February 5), https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers.Google Scholar
Holmstrom B, Milgrom P (1991) Multitask principal-agent analyses: Incentive contracts, asset ownership, and job design. J. Law Econom. Org. 7(Special Issue):24–52.Crossref, Google Scholar
Hubbard R (2011) The widespread misinterpretation of p-values as error probabilities. J. Appl. Statist. 38(11):2617–2626.Crossref, Google Scholar
Ioannidis JP (2005) Why most published research findings are false. PLoS Med. 2(8):e124.Crossref, Google Scholar
Johari R, Koomen P, Pekelis L, Walsh D (2017) Peeking at a/b tests: Why it matters, and what to do about it. Proc. 23rd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 1517–1525.Google Scholar
John LK, Loewenstein G, Prelec D (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psych. Sci. 23(5):524–532.Crossref, Google Scholar
Kleven HJ (2018) Language trends in public economics. Slides. Princeton University.Google Scholar
Kohavi R (2018) p-Hacking in A/B testing sensationalized. Accessed July 17, 2019, https://www.linkedin.com/pulse/p-hacking-ab-testing-sensationalized-ronny-kohavi/.Google Scholar
Kohavi R (2019) History of controlled experimentation. Accessed May 1, 2023, https://experimentguide.com/history/.Google Scholar
Kohavi R, Longbotham R (2020) Online controlled experiments and A/B tests. Phung D, Webb GI, Sammut C, eds. Encyclopedia of Machine Learning and Data Science (Springer US, New York), 1–13.Google Scholar
Kohavi R, Thomke S (2017) The surprising power of online experiments. Harvard Bus. Rev. 95(5):74–82.Google Scholar
Kohavi R, Tang D, Xu Y (2020) Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Kohavi R, Longbotham R, Sommerfield D, Henne RM (2009) Controlled experiments on the web: Survey and practical guide. Data Mining Knowledge Discovery 18(1):140–181.Crossref, Google Scholar
Kohavi R, Deng A, Frasca B, Walker T, Xu Y, Pohlmann N (2013) Online controlled experiments at large scale. Proc. 19th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 1168–1176.Google Scholar
Koning R, Hasan S, Chatterji A (2022) Experimentation and start-up performance: Evidence from A/B testing. Management Sci. 68(9):6434–6453.Google Scholar
Larsen N, Stallrich J, Sengupta S, Deng A, Kohavi R, Stevens NT (2024) Statistical challenges in online controlled experiments: A review of A/B testing methodology. Amer. Statist. 78(2):135–149.Crossref, Google Scholar
Leahey E (2005) Alphas and asterisks: The development of statistical significance testing standards in sociology. Soc. Forces 84(1):1–24.Crossref, Google Scholar
Lewis RA, Rao JM (2015) The unfavorable economics of measuring the returns to advertising. Quart. J. Econom. 130(4):1941–1973.Crossref, Google Scholar
Liu C, Chamberlain BP (2018) Online controlled experiments for personalised e-commerce strategies: Design, challenges, and pitfalls. Preprint, submitted March 16, https://arxiv.org/abs/1803.06258.Google Scholar
Lu L (2016) Power, minimal detectable effect, and bucket size estimation in A/B tests. Accessed July 17, 2019, https://blog.twitter.com/engineering/en_us/a/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests.Google Scholar
McCrary J (2008) Manipulation of the running variable in the regression discontinuity design: A density test. J. Econometrics 142(2):698–714.Crossref, Google Scholar
McShane BB, Gal D (2016) Blinding us to the obvious? The effect of statistical training on the evaluation of evidence. Management Sci. 62(6):1707–1718.Link, Google Scholar
Miller E (2010) How not to run an A/B test. Accessed July 17, 2019, http://www.evanmiller.org/how-not-to-run-an-ab-test.html.Google Scholar
Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716.Crossref, Google Scholar
Overgoor J (2014) Experiments at Airbnb. Accessed July 17, 2019, https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7.Google Scholar
Parker R, Rothenberg R (1988) Identifying important results from multiple statistical tests. Statist. Medicine 7(10):1031–1043.Crossref, Google Scholar
Pekelis L, Walsh D, Johari R (2015) The story behind our stats engine. Accessed July 1, 2019, https://www.optimizely.com/insights/blog/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/.Google Scholar
Pounds S, Morris SW (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10):1236–1242.Crossref, Google Scholar
Rosenthal R (1979) The file drawer problem and tolerance for null results. Psych. Bull. 86(3):638–641.Crossref, Google Scholar
Rudemo M (1982) Empirical choice of histograms and kernel density estimators. Scandinavian J. Statist. 9(2):65–78.Google Scholar
Schneider JW (2015) Null hypothesis significance tests. A mix-up of two different theories: The basis for widespread confusion and numerous misinterpretations. Scientometrics 102(1):411–432.Crossref, Google Scholar
Shaw SD, Nave G (2023) Don’t hate the player, hate the game: Realigning incentive structures to promote robust science and better scientific practices in marketing. J. Bus. Res. 167:114129.Crossref, Google Scholar
Simmons JP, Nelson LD, Simonsohn U (2011) False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psych. Sci. 22(11):1359–1366.Crossref, Google Scholar
Simmons JP, Nelson LD, Simonsohn U (2013) Life after p-hacking. Preprint, submitted January 22, http://dx.doi.org/10.2139/ssrn.2205186.Google Scholar
Simonsohn U, Nelson LD, Simmons JP (2014) p-Curve: A key to the file-drawer. J. Experiment. Psych. Gen. 143(2):534–547.Crossref, Google Scholar
Spiess J (2018) Optimal estimation when researcher and social preferences are misaligned. Working paper, Stanford University, Stanford, CA.Google Scholar
Stanley TD, Doucouliagos H (2014) Meta-regression approximations to reduce publication selection bias. Res. Synthesis Methods 5(1):60–78.Crossref, Google Scholar
Szucs D (2016) A tutorial on hunting statistical significance by chasing N. Frontiers Psych. 7:1444.Google Scholar
Tambe P, Hitt LM (2014) Job hopping, information technology spillovers, and productivity growth. Management Sci. 60(2):338–355.Link, Google Scholar
Tang D, Agarwal A, O’Brien D, Meyer M (2010) Overlapping experiment infrastructure: More, better, faster experimentation. Proc. 16th ACM SIGKDD Internal. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 17–26.Google Scholar
Thomke SH (2020) Experimentation Works: The Surprising Power of Business Experiments (Harvard Business Press, Boston).Google Scholar
Virzi AM (2018) A/B testing in marketing: The customer’s always right. Accessed July 17, 2019, https://blogs.gartner.com/anna-maria-virzi/2018/02/08/ab-testing-in-marketing-the-customers-always-right/.Google Scholar
Vogel D, Homberg F (2021) p-Hacking, p-curves, and the PSM–performance relationship: Is there evidential value? Public Administration Rev. 81(2):191–204.Crossref, Google Scholar
Wald A (1945) Sequential tests of statistical hypotheses. Ann. Math. Statist. 16(2):117–186.Crossref, Google Scholar
Walker T (2015) Warning: Most conversion optimization tips are BS (here’s why!). Accessed July 17, 2019, https://www.shopify.com/enterprise/44310083-warning-most-conversion-optimization-tips-are-bs-heres-why.Google Scholar
Warwick M (2003) Testing, Testing 1, 2, 3: Raise More Money with Direct Mail Tests (John Wiley & Sons, Hoboken, NJ).Google Scholar
Yates F (1964) Sir Ronald Fisher and the design of experiments. Biometrics 20(2):307–321.Google Scholar

cover image Information Systems Research

Volume 36, Issue 3

September 2025

Pages iv-xii, 1269-1947, iii

Article Information

Supplemental Material

Metrics

Information

Received:February 01, 2024
Accepted:September 22, 2024
Published Online:January 30, 2025

Cite as

Alex P. Miller, Kartik Hosanagar (2025) An Investigation of p-Hacking in E-Commerce A/B Testing. Information Systems Research 36(3):1691-1717.

https://doi.org/10.1287/isre.2024.0872

Keywords

Acknowledgments

The authors thank the editors and anonymous review team. The authors also thank participants and discussants at the Conference on Digital Experimentation, the Workshop on Information Systems and Economics, and the Conference on Information Systems and Technologies.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

An Investigation of p-Hacking in E-Commerce A/B Testing

References

Volume 36, Issue 3

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News