Free Access

Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

Mochen Yang
Corresponding Author
Mochen Yang
[email protected]
https://orcid.org/0000-0001-5101-9041
Department of Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455;
Search for more papers by this author
,
Edward McFowland, III
Edward McFowland, III
[email protected]
https://orcid.org/0000-0001-5249-7117
Department of Technology and Operations Management, Harvard Business School, Boston, Massachusetts 02163;
Search for more papers by this author
,
Gordon Burtch
Gordon Burtch
[email protected]
https://orcid.org/0000-0001-9798-1113
Department of Information Systems, Questrom School of Business, Boston University, Boston, Massachusetts 02215
Search for more papers by this author
,
Gediminas Adomavicius
Gediminas Adomavicius
[email protected]
https://orcid.org/0000-0001-5251-5098
Department of Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455;
Search for more papers by this author

Corresponding Author

Mochen Yang

Department of Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455;

Search for more papers by this author

Edward McFowland, III

[email protected]

https://orcid.org/0000-0001-5249-7117

Department of Technology and Operations Management, Harvard Business School, Boston, Massachusetts 02163;

Search for more papers by this author

Gordon Burtch

[email protected]

https://orcid.org/0000-0001-9798-1113

Department of Information Systems, Questrom School of Business, Boston University, Boston, Massachusetts 02215

Search for more papers by this author

Gediminas Adomavicius

[email protected]

https://orcid.org/0000-0001-5251-5098

Department of Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455;

Search for more papers by this author

Published Online:21 Sep 2022https://doi.org/10.1287/ijds.2022.0019

References

Aggarwal CC (2015) Data Mining: The Textbook (Springer, Berlin).Google Scholar
Aggarwal R, Gopal R, Gupta A, Singh H (2012) Putting money where the mouths are: The relation between venture financing and electronic word-of-mouth. Inform. Systems Res. 23(3-part-2):976–992.Link, Google Scholar
Angrist JD, Krueger AB (1995) Split-sample instrumental variables estimates of the return to schooling. J. Bus. Econom. Statist. 13(2):225–235.Google Scholar
Angrist JD, Pischke JS (2008) Mostly Harmless Econometrics: An Empiricist’s Companion (Princeton University Press, Princeton, NJ).Google Scholar
Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias. ProPublica May:23.Google Scholar
Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. Proc. National. Acad. Sci. USA 113(27):7353–7360.Google Scholar
Athey S, Imbens GW (2017) The state of applied econometrics: Causality and policy evaluation. J. Econom. Perspective 31(2):3–32.Google Scholar
Belloni A, Chen D, Chernozhukov V, Hansen C (2012) Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369–2429.Google Scholar
Berk R, Brown L, Buja A, Zhang K, Zhao L (2013) Valid post-selection inference. Ann. Statist. 41(2):802–837.Google Scholar
Bernard S, Adam S, Heutte L (2012) Dynamic random forests. Pattern Recognition Lett. 33(12):1580–1586.Google Scholar
Bernard S, Heutte L, Adam S (2010) A study of strength and correlation in random forests. Proc. Internat. Conf. on Intelligent Comput. (Springer, Berlin), 186–191.Google Scholar
Biau G, Scornet E (2016) A random forest guided tour. TEST 25(2):197–227.Google Scholar
Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J. Machine Learn. Res. 9(9).Google Scholar
Blackburn M, Neumark D (1992) Unobserved ability, efficiency wages, and interindustry wage differentials. Quart. J. Econom. 107(4):1421–1436.Google Scholar
Blaser R, Fryzlewicz P (2016) Random rotation ensembles. J. Machine Learn. Res. 17(1):126–151.Google Scholar
Blundell RW, Powell JL (2004) Endogeneity in semiparametric binary response models. Rev. Econom. Stud. 71(3):655–679.Google Scholar
Breiman L (1996) Bagging predictors. Machine Learn. 24(2):123–140.Google Scholar
Breiman L (2001) Random forests. Machine Learn. 45(1):5–32.Google Scholar
Buolamwini J, Gebru T (2018) Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Conf. on Fairness, Accountability and Transparency (Association for Computing Machinery, New York), 77–91.Google Scholar
Buse A (1992) The bias of instrumental variable estimators. Econometrica 60(1):173–180.Google Scholar
Buzas JS, Stefanski LA (1996) Instrumental variable estimation in generalized linear measurement error models. J. Amer. Statist. Assoc. 91(435):999–1006.Google Scholar
Carroll RJ, Stefanski LA (1994) Measurement error, instrumental variables and corrections for attenuation with applications to meta-analyses. Statist. Medicine 13(12):1265–1282.Google Scholar
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK (2017) Double/debiased/neyman machine learning of treatment effects. Amer. Econom. Rev. 107(5):261–265.Google Scholar
Conley TG, Hansen CB, Rossi PE (2012) Plausibly exogenous. Rev. Econom. Statist. 94(1):260–272.Google Scholar
Cook J, Stefanski L (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89(428):1314–1328.Google Scholar
Denisko D, Hoffman MM (2018) Classification and interaction in random forests. Proc. National Acad. Sci. USA 115(8):1690–1692.Google Scholar
Ebbes P, Wedel M, Böckenholt U (2009) Frugal iv alternatives to identify the parameter for an endogenous regressor. J. Appl. Econometrics 24(3):446–468.Google Scholar
Ebbes P, Wedel M, Böckenholt U, Steerneman T (2005) Solving and testing for regressor-error (in) dependence when no instrumental variables are available: With new evidence for the effect of education on income. Quant. Marketing Econom. 3(4):365–392.Google Scholar
Ellis PD (2010) The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results (Cambridge University Press, Cambridge, UK).Google Scholar
Fanaee-T H, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Progress Artificial Intelligence 2(2-3):113–127.Google Scholar
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J. Machine Learn. Res. 15(1):3133–3181.Google Scholar
Fong C, Tyler M (2021) Machine learning predictions as regression covariates. Political Anal. 29(4):467–484.Google Scholar
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Proc. 13th Internat. Conf. Internat. Conf. Machine Learn. (ACM, New York), 148–156.Google Scholar
Frisch R, Waugh FV (1933) Partial time regressions as compared with individual trends. Econometrica 1(4):387–401.Google Scholar
Gebru T, Krause J, Wang Y, Chen D, Deng J, Aiden EL, Fei-Fei L(2017) Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proc. National Acad. Sci. USA 114(50):13108–13113.Google Scholar
Ghose A, Ipeirotis PG (2010) Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Trans. Knowledge Data Engrg. 23(10):1498–1512.Google Scholar
Ghose A, Ipeirotis PG, Li B (2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Sci. 31(3):493–520.Link, Google Scholar
Giot R, Cherrier R (2014) Predicting bikeshare system usage up to one day ahead. Proc. IEEE Sympos. on Comput. Intelligence in Vehicles and Transportation Systems (IEEE, New York), 22–29.Google Scholar
Goh KY, Heng CS, Lin Z (2013) Social media brand community and consumer behavior: Quantifying the relative impact of user-and marketer-generated content. Inform. Systems Res. 24(1):88–107.Link, Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
Grace YY (2016) Statistical Analysis with Measurement Error or Misclassification (Springer, Berlin).Google Scholar
Greene WH (2003) Econometric Analysis (Pearson Education India).Google Scholar
Gu B, Konana P, Raghunathan R, Chen HM (2014) Research note-the allure of homophily in social media: Evidence from investor responses on virtual communities. Inform. Systems Res. 25(3):604–617.Link, Google Scholar
Gu B, Konana P, Rajagopalan B, Chen HWM (2007) Competition among virtual communities and user valuation: The case of investing-related communities. Inform. Systems Res. 18(1):68–85.Link, Google Scholar
Gustafson P (2003) Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments (CRC Press, Boca Raton, FL).Google Scholar
Györfi L, Kohler M, Krzyzak A, Walk H (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, New York).Google Scholar
Hausman JA (1978) Specification tests in econometrics. Econometrica 46(6):1251–1271.Google Scholar
Hausman J (2001) Mismeasured variables in econometric analysis: Problems from the right and problems from the left. J. Econom. Perspective 15(4):57–67.Google Scholar
Hausman JA, Newey WK, Powell JL (1995) Nonlinear errors in variables estimation of some engel curves. J. Econometrics 65(1):205–233.Google Scholar
Hu Y, Schennach SM (2008) Instrumental variable treatment of nonclassical measurement error models. Econometrica 76(1):195–216.Google Scholar
Jelveh Z, Kogut B, Naidu S (2015) Political language in economics. Working paper.Google Scholar
Küchenhoff H, Lederer W, Lesaffre E (2007) Asymptotic variance estimation for the misclassification SIMEX. Comput. Statist. Data Anal. 51(12):6197–6211.Google Scholar
Küchenhoff H, Mwalili SM, Lesaffre E (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85–96.Google Scholar
Lee JD, Sun DL, Sun Y, Taylor JE (2016) Exact post-selection inference, with application to the lasso. Ann. Statist. 44(3):907–927.Google Scholar
Lewbel A (2019) Using instrumental variables to estimate models with mismeasured regressors. Working paper.Google Scholar
Liu Y, Chen R, Chen Y, Mei Q, Salib S (2012)” i loan because…” understanding motivations for pro-social lending. Proc. 5th ACM Internat. Conf. on Web Search and Data Mining, 503–512.Google Scholar
Loken E, Gelman A (2017) Measurement error and the replication crisis. Science 355(6325):584–585.Google Scholar
Lu Y, Jerath K, Singh PV (2013) The emergence of opinion leaders in a networked online community: A dyadic model with time dynamics and a heuristic for fast estimation. Management Sci. 59(8):1783–1799.Link, Google Scholar
Mammen E, Rothe C, Schienle M (2016) Semiparametric estimation with generated covariates. Econometric Theory 32(5):1140–1177.Google Scholar
Mammen E, Rothe C, Schienle M (2012) Nonparametric regression with nonparametrically generated covariates. Ann. Statist. 40(2):1132–1170.Google Scholar
McFowland III E, Somanchi S, Neill DB (2018) Efficient discovery of heterogeneous treatment effects in randomized experiments via anomalous pattern detection. Preprint, submitted March 24, https://arxiv.org/abs/1803.09159.Google Scholar
Meng L, Wu B, Zhan Z (2016) Linear regression with an estimated regressor: Applications to aggregate indicators of economic development. Empirical Econom. 50(2):299–316.Google Scholar
Moreno A, Terwiesch C (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865–886.Link, Google Scholar
Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62:22–31.Google Scholar
Murphy KM, Topel RH (1985) Estimation and inference in two-step econometric models. J. Bus. Econom. Statist. 20(1):88–97.Google Scholar
Murray MP (2006) Avoiding invalid instruments and coping with weak instruments. J. Econom. Perspective 20(4):111–132.Google Scholar
Nagar AL (1959) The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica 27(4):575–595.Google Scholar
Newey WK (1984) A method of moments interpretation of sequential estimators. Econom. Lett. 14(2-3):201–206.Google Scholar
Oxley L, McAleer M (1993) Econometric issues in macroeconomic models with generated regressors. J. Econom. Survery 7(1):1–40.Google Scholar
Pagan A (1984) Econometric issues in the analysis of regressions with generated regressors. Internat. Econom. Rev. 25(1):221–247.Google Scholar
Roodman D (2009) A note on the theme of too many instruments. Oxf. Bull. Econom. Statist. 71(1):135–158.Google Scholar
Ryu JY, Kim HU, Lee SY (2018) Deep learning improves prediction of drug–drug and drug–food interactions. Proc. National Acad. Sci. USA 115(18):E4304–E4311.Google Scholar
Schennach SM (2016) Recent advances in the measurement error literature. Annu. Rev. Econom. 8:341–377.Google Scholar
Scornet E, Biau G, Vert JP, et al. (2015) Consistency of random forests. Ann. Statist. 43(4):1716–1741.Google Scholar
Seber GA (2009) Multivariate Observations, vol. 252 (John Wiley & Sons, Hoboken, NJ).Google Scholar
Singh PV, Sahoo N, Mukhopadhyay T (2014) How to attract and retain readers in enterprise blogging? Inform. Systems Res. 25(1):35–52.Link, Google Scholar
Sperlich S (2009) A note on non-parametric estimation with predicted variables. Econom. J. 12(2):382–395.Google Scholar
Taylor J, Tibshirani RJ (2015) Statistical learning and selective inference. Proc. National Acad. Sci. USA 112(25):7629–7634.Google Scholar
Tirunillai S, Tellis GJ (2012) Does chatter really matter? Dynamics of user-generated content and stock performance. Marketing Sci. 31(2):198–215.Link, Google Scholar
Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: A survey and results of new tests. Pattern Recognition 44(2):330–349.Google Scholar
Wang T, Kannan KN, Ulmer JR (2013) The association between the disclosure and the realization of information security risk factors. Inform. Systems Res. 24(2):201–218.Link, Google Scholar
Wooldridge JM (2002) Econometric Analysis of Cross Section and Panel Data (MIT Press, Cambridge, MA).Google Scholar
Yang M, Adomavicius G, Burtch G, Ren Y (2018) Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inform. Systems Res. 29(1):4–24.Link, Google Scholar
Zhu H, Kraut R, Kittur A (2012) Effectiveness of shared leadership in online communities. Proc. ACM Conf. on Comput. Supported Cooperative Work (Association for Computing Machinery, New York), 407–416.Google Scholar
Zhu H, Kraut RE, Wang YC, Kittur A (2011) Identifying shared leadership in wikipedia. Proc. SIGCHI Conf. on Human Factors in Comput. Systems (Association for Computing Machinery, New York), 3431–3434.Google Scholar

cover image INFORMS Journal on Data Science

Volume 1, Issue 2

October-December 2022

Pages 115-195, C2

Article Information

Supplemental Material

Metrics

Information

Received:February 14, 2022
Accepted:June 27, 2022
Published Online:September 21, 2022

Cite as

Mochen Yang, Edward McFowland, III, Gordon Burtch, Gediminas Adomavicius (2022) Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem. INFORMS Journal on Data Science 1(2):138-155.

https://doi.org/10.1287/ijds.2022.0019

Keywords

PDF download

Available Issues

Available Issues

Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

References

Volume 1, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News