EnsembleIV: Creating Instrumental Variables from Ensemble Learners for Robust Statistical Inference with ML- Generated Variables
Abstract
Advances in machine learning have made it easier to extract useful information from both structured and unstructured data. Accordingly, empirical researchers often seek to leverage machine learning for statistical inference and hypothesis testing. We study an increasingly popular practice wherein a supervised machine learning model is trained to predict a certain variable of interest, and the predicted values are subsequently used in regression models as independent variables to draw statistical inferences. However, inevitably, errors in predictions manifest as measurement errors in regression models and lead to estimation biases. In this paper, we design and evaluate a novel approach, termed EnsembleIV, to address the issue. We propose the use of ensemble machine learning techniques to generate the predictions and show that individual learners in the ensemble (after a proposed special-purpose data-driven transformation procedure) can serve as instrumental variables to correct for the measurement error and avoid estimation biases. EnsembleIV’s effectiveness is demonstrated on both synthetic and real data sets for both linear and generalized linear regression models. We also compare EnsembleIV with several alternative bias correction methods and highlight its advantages. Overall, EnsembleIV represents a flexible algorithm that enables empirical researchers to draw robust statistical inferences with independent variables generated via machine learning.
This paper was accepted by D.J. Wu, information systems.
Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2024.08999.

