Allon G, Chen D, Jiang Z, Zhang D (2023) Machine learning and prediction errors in causal inference. Preprint, submitted June 15, https://doi.org/10.2139/ssrn.4480696.Google Scholar
Andrews I, Stock JH, Sun L (2019) Weak instruments in instrumental variables regression: Theory and practice. Ann. Rev. Econom. 11:727–753.Crossref, Google Scholar
Angrist JD, Pischke JS (2008) Mostly Harmless Econometrics: An Empiricist’s Companion (Princeton University Press, Princeton, NJ).Crossref, Google Scholar
Belloni A, Chen D, Chernozhukov V, Hansen C (2012) Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369–2429.Crossref, Google Scholar
Bound J, Jaeger DA, Baker RM (1995) Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J. Amer. Statist. Assoc. 90(430):443–450.Google Scholar
Breiman L (2001) Random forests. Machine Learn. 45(1):5–32.Crossref, Google Scholar
Buzas JS, Stefanski LA (1996) Instrumental variable estimation in generalized linear measurement error models. J. Amer. Statist. Assoc. 91(435):999–1006.Crossref, Google Scholar
Carroll RJ, Ruppert D, Stefanski LA (1995) Measurement Error in Nonlinear Models, vol. 105 (CRC Press, Boca Raton, FL).Crossref, Google Scholar
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement Error in Nonlinear Models: A Modern Perspective (Chapman and Hall/CRC).Crossref, Google Scholar
Cengiz D, Dube A, Lindner A, Zentler-Munro D (2022) Seeing beyond the trees: Using machine learning to estimate the impact of minimum wages on labor market outcomes. J. Labor Econom. 40(S1):S203–S247.Crossref, Google Scholar
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. Krishnapuram B, Shah M, Smola A, Aggarwal C, Shen D, Rastogi R, eds. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 785–794.Google Scholar
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018) Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. Econom. J. 21(1):C1–C68.Crossref, Google Scholar
Conley TG, Hansen CB, Rossi PE (2012) Plausibly exogenous. Rev. Econom. Statist. 94(1):260–272.Crossref, Google Scholar
Cragg JG, Donald SG (1993) Testing identifiability and specification in instrumental variable models. Econom. Theory 9(2):222–240.Crossref, Google Scholar
Davidson R, MacKinnon JG (2006) The power of bootstrap and asymptotic tests. J. Econom. 133(2):421–441.Crossref, Google Scholar
Davidson R, MacKinnon JG (2008) Bootstrap inference in a linear equation estimated by instrumental variables. Econom. J. 11(3):443–477.Crossref, Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint, submitted October 11, https://arxiv.org/abs/1810.04805.Google Scholar
Donald SG, Newey WK (2001) Choosing the number of instruments. Econometrica 69(5):1161–1191.Crossref, Google Scholar
Fanaee-T H, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Progress Artificial Intelligence 2(2–3):113–127.Crossref, Google Scholar
Fisher RA (1925) Statistical Methods for Research Workers, 5th ed. (Oliver and Boyd).Google Scholar
Fong C, Tyler M (2021) Machine learning predictions as regression covariates. Political Anal. (Oxford) 29(4):467–484.Crossref, Google Scholar
Fuller WA (1977) Some properties of a modification of the limited information estimator. Econometrica 45(4):939–953.Crossref, Google Scholar
Gleser LJ (1992) The importance of assessing measurement reliability in multivariate regression. J. Amer. Statist. Assoc. 87(419):696–707.Crossref, Google Scholar
Goh KY, Heng CS, Lin Z (2013) Social media brand community and consumer behavior: Quantifying the relative impact of user-and marketer-generated content. Inform. Systems Res. 24(1):88–107.Link, Google Scholar
Greene WH (2003) Econometric Analysis (Pearson Education India, Chennai, India).Google Scholar
Hall P (1992) The Bootstrap and Edgeworth Expansion (Springer, New York).Crossref, Google Scholar
Hopkins D, King G (2007) Extracting systematic social science meaning from text.Google Scholar
Horowitz JL (2019) Bootstrap methods in econometrics. Annu. Rev. Econom. 11(1):193–224.Crossref, Google Scholar
Hu Y, Schennach SM (2008) Instrumental variable treatment of nonclassical measurement error models. Econometrica 76(1):195–216.Crossref, Google Scholar
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, et al. (2017) LightGBM: A highly efficient gradient boosting decision tree. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems vol. 30 (Curran Associates Inc., Red Hook, NY).Google Scholar
Kleibergen F, Paap R (2006) Generalized reduced rank tests using the singular value decomposition. J. Econom. 133(1):97–126.Crossref, Google Scholar
Küchenhoff H, Mwalili SM, Lesaffre E (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85–96.Crossref, Google Scholar
Lee D, Hosanagar K, Nair HS (2018) Advertising content and consumer engagement on social media: Evidence from Facebook. Management Sci. 64(11):5105–5131.Link, Google Scholar
Mehrhoff J (2009) A solution to the problem of too many instruments in dynamic panel data GMM. Bundesbank Series 1 Discussion Paper No. 2009,31, https://doi.org/10.2139/ssrn.2785360.Google Scholar
Moreno A, Terwiesch C (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865–886.Link, Google Scholar
Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62:22–31.Crossref, Google Scholar
Murray MP (2006) Avoiding invalid instruments and coping with weak instruments. J. Econom. Perspective 20(4):111–132.Crossref, Google Scholar
Nevo A, Rosen AM (2012) Identification with imperfect instruments. Rev. Econom. Statist. 94(3):659–671.Crossref, Google Scholar
Oxley L, McAleer M (1993) Econometric issues in macroeconomic models with generated regressors. J. Econom. Survey 7(1):1–40.Crossref, Google Scholar
Pagan A (1984) Econometric issues in the analysis of regressions with generated regressors. Internat. Econom. Rev. (Philadelphia) 25(1):221–247.Crossref, Google Scholar
Qiao M, Huang KW (2021) Correcting misclassification bias in regression models with variables generated via data mining. Inform. Systems Res. 32(2):462–480.Link, Google Scholar
Roodman D (2009) A note on the theme of too many instruments. Oxford Bull. Econom. Statist. 71(1):135–158.Crossref, Google Scholar
Stefanski ALA, Cook JR (1995) Simulation-extrapolation: The measurement error Jackknife. J. Amer. Statist. Assoc. 90(432):1247–1256.Crossref, Google Scholar
Stock JH, Yogo M (2002) Testing for weak instruments in linear IV regression. NBER Working Paper No. 0284, National Bureau of Economic Research, Cambridge, MA.Google Scholar
Stock JH, Wright JH, Yogo M (2002) A survey of weak instruments and weak identification in generalized method of moments. J. Bus. Econom. Statist. 20(4):518–529.Crossref, Google Scholar
Terza JV, Basu A, Rathouz PJ (2008) Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling. J. Health Econom. 27(3):531–543.Crossref, Google Scholar
Tirunillai S, Tellis GJ (2012) Does chatter really matter? Dynamics of user-generated content and stock performance. Marketing Sci. 31(2):198–215.Link, Google Scholar
Wan F, Small D, Mitra N (2018) A general approach to evaluating the bias of 2-stage instrumental variable estimators. Statist. Medicine 37(12):1997–2015.Crossref, Google Scholar
Wang S, McCormick TH, Leek JT (2020) Methods for correcting inference based on outcomes predicted by machine learning. Proc. Natl. Acad. Sci. USA 117(48):30266–30275.Crossref, Google Scholar
Wei Y, Malik N (2022) Unstructured data, econometric models, and estimation bias. Preprint, submitted May 22, http://dx.doi.org/10.2139/ssrn.4113608.Google Scholar
Welch WJ (1990) Construction of permutation tests. J. Amer. Statist. Assoc. 85(411):693–698.Crossref, Google Scholar
Wooldridge JM (2002) Econometric Analysis of Cross Section and Panel Data (MIT Press, Cambridge, MA).Google Scholar
Wu X, Nethery RC, Sabath MB, Braun D, Dominici F (2020) Air pollution and Covid-19 mortality in the United States: Strengths and limitations of an ecological regression analysis. Sci. Adv. 6(45):eabd4049.Crossref, Google Scholar
Yang M, Ren Y, Adomavicius G (2019) Understanding user-generated content and customer engagement on facebook business pages. Inform. Systems Res. 30(3):839–855.Link, Google Scholar
Yang M, Adomavicius G, Burtch G, Ren Y (2018) Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inform. Systems Res. 29(1):4–24.Link, Google Scholar
Yang M, McFowland E III, Burtch G, Adomavicius G (2022) Achieving reliable causal inference with data-mined variables: A random forest approach to the measurement error problem. INFORMS J. Data Sci. 1(2):138–155.Link, Google Scholar
Zhang S, Lee D, Singh PV, Srinivasan K (2021) What makes a good image? Airbnb demand analytics leveraging interpretable image features. Management Sci. 68(8):5644–5666.Google Scholar
Zhang J, Xue W, Yu Y, Tan Y (2023) Debiasing machine-learning-or AI-generated regressors in partial linear models. Preprint, submitted November 17, http://dx.doi.org/10.2139/ssrn.4636026.Google Scholar

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:January 06, 2025
Accepted:December 30, 2025
Published Online:June 19, 2026

Cite as

Gordon Burtch, Edward McFowland III, Mochen Yang, Gediminas Adomavicius (2026) EnsembleIV: Creating Instrumental Variables from Ensemble Learners for Robust Statistical Inference with ML- Generated Variables. Management Science 0(0).

https://doi.org/10.1287/mnsc.2024.08999

Keywords

Acknowledgments

The authors thank Davide Viviano for valuable dialogue and feedback on the development of our theoretical results.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

EnsembleIV: Creating Instrumental Variables from Ensemble Learners for Robust Statistical Inference with ML- Generated Variables

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News