An Investigation of p-Hacking in E-Commerce A/B Testing
References
- (2017) A nonparametric sequential test for online randomized experiments. Proc. 26th Internat. Conf. World Wide Web Companion (International World Wide Web Conferences Steering Committee, Geneva), 610–616Google Scholar
- (2002) A mixture model approach for the analysis of microarray gene expression data. Comput. Statist. Data Anal. 39(1):1–20.Crossref, Google Scholar
- (2019) Identification of and correction for publication bias. Amer. Econom. Rev. 109(8):2766–2794.Crossref, Google Scholar
- (1954) Fixed-sample-size analysis of sequential observations. Biometrics 10(1):89–100.Crossref, Google Scholar
- Azevedo EM, Deng A, Montiel Olea JL, Rao J, Weyl EG (2020) A/B testing with fat tails. J. Political Econom. 128(12):4614.Google Scholar
- (1992) Incentive contracts and performance measurement. J. Political Econom. 100(3):598–614.Crossref, Google Scholar
- (2022) False discovery in A/B testing. Management Sci. 68(9):6762–6782.Link, Google Scholar
- (2018) p-Hacking and false discovery in A/B testing. Preprint, submitted July 18, http://dx.doi.org/10.2139/ssrn.3204791.Google Scholar
- (2014) How Optimizely (almost) got me fired. Accessed June 8, 2018, https://web.archive.org/web/20180608142925/https://blog.sumall.com/journal/optimizely-got-me-fired.html.Google Scholar
- Bothwell LE, Podolsky SH (2016) The emergence of the randomized, controlled trial. New England J. Medicine 375(6):501–504.Google Scholar
- (2020) Methods matter: P-hacking and publication bias in causal analysis in economics. Amer. Econom. Rev. 110(11):3634–3660.Crossref, Google Scholar
- (2022) We need to talk about Mechanical Turk: What 22,989 hypothesis tests tell us about publication bias and p-hacking in online experiments. Preprint, submitted August 12, http://dx.doi.org/10.2139/ssrn.4188289.Google Scholar
- (2023) Unpacking p-hacking and publication bias. Amer. Econom. Rev. 113(11):2974–3002.Crossref, Google Scholar
- (2016) The rapid adoption of data-driven decision-making. Amer. Econom. Rev. 106(5):133–139.Crossref, Google Scholar
- BuiltWith (2019) A/B testing usage distribution in the top 1 million sites. Accessed July 17, 2019, https://web.archive.org/web/20190717062204/https://trends.builtwith.com/analytics/a-b-testing.Google Scholar
- (2018) Manipulation testing based on density discontinuity. Stata J. 18(1):234–261.Crossref, Google Scholar
- (2000) A decision theoretic approach to targeted advertising. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 81–88.Google Scholar
- (2012a) The A/B test: Inside the technology that’s changing the rules of business. WIRED (April 25), https://www.wired.com/2012/04/ff-abtesting/.Google Scholar
- (2012b) Test everything: Notes on the A/B revolution. WIRED (May 9), https://www.wired.com/2012/05/test-everything/.Google Scholar
- (2016) Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. 2016 IEEE Internat. Conf. Data Sci. Advanced Anal. (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 243–252.Google Scholar
- (2016) The fatal flaw of A/B tests: Peeking. Accessed July 17, 2019, https://www.lucidchart.com/blog/the-fatal-flaw-of-ab-tests-peeking.Google Scholar
- (2019) Statistical Significance and the Replication Crisis in the Social Sciences. Oxford Research Encyclopedia of Economics and Finance (Oxford University Press, Oxford, UK).Google Scholar
- (2015) Replication, falsification, and the crisis of confidence in social psychology. Frontiers Psych. 6:621.Google Scholar
- (2017) Building an intelligent experimentation platform with Uber Engineering. Accessed July 21, 2019, https://eng.uber.com/experimentation-platform/.Google Scholar
- (1925) Statistical Methods for Research Workers (Oliver & Boyd, Edinburgh, Scotland).Google Scholar
- (1954) Sequential analysis in psychological research. Psych. Bull. 51(3):264–275.Crossref, Google Scholar
- (2021) The top 3 mistakes that make your A/B test results invalid. Accessed September 1, 2021, https://conversion.com/blog/3-mistakes-invalidate-ab-test-results/.Google Scholar
- (2019) Making sense of A/B testing statistics. Accessed September 1, 2021, https://www.confidenceinterval.com/blog/making-sense-of-ab-testing-statistics/.Google Scholar
- Gelman A, Loken E (2013) The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University 348.1–17:3.Google Scholar
- (2008a) Do statistical reporting standards affect what is published? Publication bias in two leading political science journals. Quart. J. Political Sci. 3(3):313–326.Crossref, Google Scholar
- (2008b) Publication bias in empirical sociological research: Do arbitrary significance levels distort published results? Sociol. Methods Res. 37(1):3–30.Crossref, Google Scholar
- (2020) The effects of hierarchy on learning and performance in business experimentation. Acad. Management Proc. 2020(1):20500.Crossref, Google Scholar
- (2020) The politics of experimentation. Preprint, submitted April 13, http://dx.doi.org/10.2139/ssrn.3571296.Google Scholar
- (2013) Six decades of top economics publishing: Who and how? J. Econom. Lit. 51(1):162–172.Crossref, Google Scholar
- (2014) Why Google has 200m reasons to put engineers over designers. Guardian (February 5), https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers.Google Scholar
- (1991) Multitask principal-agent analyses: Incentive contracts, asset ownership, and job design. J. Law Econom. Org. 7(Special Issue):24–52.Crossref, Google Scholar
- (2011) The widespread misinterpretation of p-values as error probabilities. J. Appl. Statist. 38(11):2617–2626.Crossref, Google Scholar
- (2005) Why most published research findings are false. PLoS Med. 2(8):e124.Crossref, Google Scholar
- (2017) Peeking at a/b tests: Why it matters, and what to do about it. Proc. 23rd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 1517–1525.Google Scholar
- (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psych. Sci. 23(5):524–532.Crossref, Google Scholar
- (2018) Language trends in public economics. Slides. Princeton University.Google Scholar
- (2018) p-Hacking in A/B testing sensationalized. Accessed July 17, 2019, https://www.linkedin.com/pulse/p-hacking-ab-testing-sensationalized-ronny-kohavi/.Google Scholar
- (2019) History of controlled experimentation. Accessed May 1, 2023, https://experimentguide.com/history/.Google Scholar
- Kohavi R, Longbotham R (2020) Online controlled experiments and A/B tests. Phung D, Webb GI, Sammut C, eds. Encyclopedia of Machine Learning and Data Science (Springer US, New York), 1–13.Google Scholar
- (2017) The surprising power of online experiments. Harvard Bus. Rev. 95(5):74–82.Google Scholar
- (2020) Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2009) Controlled experiments on the web: Survey and practical guide. Data Mining Knowledge Discovery 18(1):140–181.Crossref, Google Scholar
- (2013) Online controlled experiments at large scale. Proc. 19th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 1168–1176.Google Scholar
- Koning R, Hasan S, Chatterji A (2022) Experimentation and start-up performance: Evidence from A/B testing. Management Sci. 68(9):6434–6453.Google Scholar
- (2024) Statistical challenges in online controlled experiments: A review of A/B testing methodology. Amer. Statist. 78(2):135–149.Crossref, Google Scholar
- (2005) Alphas and asterisks: The development of statistical significance testing standards in sociology. Soc. Forces 84(1):1–24.Crossref, Google Scholar
- (2015) The unfavorable economics of measuring the returns to advertising. Quart. J. Econom. 130(4):1941–1973.Crossref, Google Scholar
- (2018) Online controlled experiments for personalised e-commerce strategies: Design, challenges, and pitfalls. Preprint, submitted March 16, https://arxiv.org/abs/1803.06258.Google Scholar
- (2016) Power, minimal detectable effect, and bucket size estimation in A/B tests. Accessed July 17, 2019, https://blog.twitter.com/engineering/en_us/a/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests.Google Scholar
- (2008) Manipulation of the running variable in the regression discontinuity design: A density test. J. Econometrics 142(2):698–714.Crossref, Google Scholar
- (2016) Blinding us to the obvious? The effect of statistical training on the evaluation of evidence. Management Sci. 62(6):1707–1718.Link, Google Scholar
- (2010) How not to run an A/B test. Accessed July 17, 2019, http://www.evanmiller.org/how-not-to-run-an-ab-test.html.Google Scholar
- Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716.Crossref, Google Scholar
- (2014) Experiments at Airbnb. Accessed July 17, 2019, https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7.Google Scholar
- (1988) Identifying important results from multiple statistical tests. Statist. Medicine 7(10):1031–1043.Crossref, Google Scholar
- (2015) The story behind our stats engine. Accessed July 1, 2019, https://www.optimizely.com/insights/blog/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/.Google Scholar
- (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10):1236–1242.Crossref, Google Scholar
- (1979) The file drawer problem and tolerance for null results. Psych. Bull. 86(3):638–641.Crossref, Google Scholar
- (1982) Empirical choice of histograms and kernel density estimators. Scandinavian J. Statist. 9(2):65–78.Google Scholar
- (2015) Null hypothesis significance tests. A mix-up of two different theories: The basis for widespread confusion and numerous misinterpretations. Scientometrics 102(1):411–432.Crossref, Google Scholar
- (2023) Don’t hate the player, hate the game: Realigning incentive structures to promote robust science and better scientific practices in marketing. J. Bus. Res. 167:114129.Crossref, Google Scholar
- (2011) False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psych. Sci. 22(11):1359–1366.Crossref, Google Scholar
- (2013) Life after p-hacking. Preprint, submitted January 22, http://dx.doi.org/10.2139/ssrn.2205186.Google Scholar
- (2014) p-Curve: A key to the file-drawer. J. Experiment. Psych. Gen. 143(2):534–547.Crossref, Google Scholar
- (2018) Optimal estimation when researcher and social preferences are misaligned. Working paper, Stanford University, Stanford, CA.Google Scholar
- (2014) Meta-regression approximations to reduce publication selection bias. Res. Synthesis Methods 5(1):60–78.Crossref, Google Scholar
- (2016) A tutorial on hunting statistical significance by chasing N. Frontiers Psych. 7:1444.Google Scholar
- (2014) Job hopping, information technology spillovers, and productivity growth. Management Sci. 60(2):338–355.Link, Google Scholar
- (2010) Overlapping experiment infrastructure: More, better, faster experimentation. Proc. 16th ACM SIGKDD Internal. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 17–26.Google Scholar
- (2020) Experimentation Works: The Surprising Power of Business Experiments (Harvard Business Press, Boston).Google Scholar
- (2018) A/B testing in marketing: The customer’s always right. Accessed July 17, 2019, https://blogs.gartner.com/anna-maria-virzi/2018/02/08/ab-testing-in-marketing-the-customers-always-right/.Google Scholar
- (2021) p-Hacking, p-curves, and the PSM–performance relationship: Is there evidential value? Public Administration Rev. 81(2):191–204.Crossref, Google Scholar
- (1945) Sequential tests of statistical hypotheses. Ann. Math. Statist. 16(2):117–186.Crossref, Google Scholar
- (2015) Warning: Most conversion optimization tips are BS (here’s why!). Accessed July 17, 2019, https://www.shopify.com/enterprise/44310083-warning-most-conversion-optimization-tips-are-bs-heres-why.Google Scholar
- (2003) Testing, Testing 1, 2, 3: Raise More Money with Direct Mail Tests (John Wiley & Sons, Hoboken, NJ).Google Scholar
- Yates F (1964) Sir Ronald Fisher and the design of experiments. Biometrics 20(2):307–321.Google Scholar

