A Robust Optimization Approach to Reliable Statistical Inference with Variables Generated by Machine Learning

Published Online:https://doi.org/10.1287/isre.2023.0340

Leveraging supervised machine learning (SML) algorithms to operationalize constructs from unstructured data such as text or images is becoming increasingly common in practice and research. As a result, variables generated through SML are now used in traditional regression models to test hypotheses. However, algorithms are imperfect, and thus, the variables produced by SML have measurement errors relative to the underlying construct, potentially leading to biased coefficients and faulty inference. In this paper, we propose using robust optimization to reduce the negative impact of these errors and enable more accurate hypothesis testing. We leverage robust optimization techniques to fit a linear regression model in the presence of measurement errors of different magnitudes. We theoretically demonstrate the bias, variance, and hypothesis testing performance of the robust approach and propose a correction term to effectively reduce bias. Through experiments on simulated data sets and a case study of Amazon reviews, we demonstrate the effectiveness of our approach and identify conditions in which robust optimization likely outperforms other methods. We make recommendations for researchers leveraging machine learning–generated variables in causal inference.

History: Olivia Liu Sheng, Senior Editor; Huimin Zhao, Associate Editor.

Funding: The authors acknowledge support from the Terry Sanford Award from the University of Georgia.

Supplemental Material: The online appendix is available at https://doi.org/10.1287/isre.2023.0340.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.