A Robust Optimization Approach to Reliable Statistical Inference with Variables Generated by Machine Learning

Aaron Schecter
Corresponding Author
Aaron Schecter
[email protected]
https://orcid.org/0000-0002-3186-7788
Department of Management Information Systems, University of Georgia, Athens, Georgia 30602
Search for more papers by this author
,
Weifeng Li
Weifeng Li
[email protected]
https://orcid.org/0000-0002-2105-3596
Department of Management Information Systems, University of Georgia, Athens, Georgia 30602
Search for more papers by this author

Aaron Schecter

Corresponding Author

Aaron Schecter

[email protected]

https://orcid.org/0000-0002-3186-7788

Department of Management Information Systems, University of Georgia, Athens, Georgia 30602

Search for more papers by this author

Weifeng Li

[email protected]

https://orcid.org/0000-0002-2105-3596

Department of Management Information Systems, University of Georgia, Athens, Georgia 30602

Search for more papers by this author

Published Online:24 Dec 2025https://doi.org/10.1287/isre.2023.0340

Abstract

Leveraging supervised machine learning (SML) algorithms to operationalize constructs from unstructured data such as text or images is becoming increasingly common in practice and research. As a result, variables generated through SML are now used in traditional regression models to test hypotheses. However, algorithms are imperfect, and thus, the variables produced by SML have measurement errors relative to the underlying construct, potentially leading to biased coefficients and faulty inference. In this paper, we propose using robust optimization to reduce the negative impact of these errors and enable more accurate hypothesis testing. We leverage robust optimization techniques to fit a linear regression model in the presence of measurement errors of different magnitudes. We theoretically demonstrate the bias, variance, and hypothesis testing performance of the robust approach and propose a correction term to effectively reduce bias. Through experiments on simulated data sets and a case study of Amazon reviews, we demonstrate the effectiveness of our approach and identify conditions in which robust optimization likely outperforms other methods. We make recommendations for researchers leveraging machine learning–generated variables in causal inference.

History: Olivia Liu Sheng, Senior Editor; Huimin Zhao, Associate Editor.

Funding: The authors acknowledge support from the Terry Sanford Award from the University of Georgia.

Supplemental Material: The online appendix is available at https://doi.org/10.1287/isre.2023.0340.

cover image Information Systems Research

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:July 07, 2023
Accepted:November 15, 2025
Published Online:December 24, 2025

Cite as

Aaron Schecter, Weifeng Li (2025) A Robust Optimization Approach to Reliable Statistical Inference with Variables Generated by Machine Learning. Information Systems Research 0(0).

https://doi.org/10.1287/isre.2023.0340

Keywords

Acknowledgments

A. Schecter thanks the faculty at the University of Notre Dame Department of Information Technology, Analytics, and Operations for their helpful feedback. The authors thank the anonymous reviewers, associate editor, and senior editor for their constructive feedback.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A Robust Optimization Approach to Reliable Statistical Inference with Variables Generated by Machine Learning

Abstract

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News