Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

Published Online:https://doi.org/10.1287/ijds.2022.0019

References

  • Aggarwal CC (2015) Data Mining: The Textbook (Springer, Berlin).Google Scholar
  • Aggarwal R, Gopal R, Gupta A, Singh H (2012) Putting money where the mouths are: The relation between venture financing and electronic word-of-mouth. Inform. Systems Res. 23(3-part-2):976–992.LinkGoogle Scholar
  • Angrist JD, Krueger AB (1995) Split-sample instrumental variables estimates of the return to schooling. J. Bus. Econom. Statist. 13(2):225–235.Google Scholar
  • Angrist JD, Pischke JS (2008) Mostly Harmless Econometrics: An Empiricist’s Companion (Princeton University Press, Princeton, NJ).Google Scholar
  • Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias. ProPublica May:23.Google Scholar
  • Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. Proc. National. Acad. Sci. USA 113(27):7353–7360.Google Scholar
  • Athey S, Imbens GW (2017) The state of applied econometrics: Causality and policy evaluation. J. Econom. Perspective 31(2):3–32.Google Scholar
  • Belloni A, Chen D, Chernozhukov V, Hansen C (2012) Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369–2429.Google Scholar
  • Berk R, Brown L, Buja A, Zhang K, Zhao L (2013) Valid post-selection inference. Ann. Statist. 41(2):802–837.Google Scholar
  • Bernard S, Adam S, Heutte L (2012) Dynamic random forests. Pattern Recognition Lett. 33(12):1580–1586.Google Scholar
  • Bernard S, Heutte L, Adam S (2010) A study of strength and correlation in random forests. Proc. Internat. Conf. on Intelligent Comput. (Springer, Berlin), 186–191.Google Scholar
  • Biau G, Scornet E (2016) A random forest guided tour. TEST 25(2):197–227.Google Scholar
  • Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J. Machine Learn. Res. 9(9).Google Scholar
  • Blackburn M, Neumark D (1992) Unobserved ability, efficiency wages, and interindustry wage differentials. Quart. J. Econom. 107(4):1421–1436.Google Scholar
  • Blaser R, Fryzlewicz P (2016) Random rotation ensembles. J. Machine Learn. Res. 17(1):126–151.Google Scholar
  • Blundell RW, Powell JL (2004) Endogeneity in semiparametric binary response models. Rev. Econom. Stud. 71(3):655–679.Google Scholar
  • Breiman L (1996) Bagging predictors. Machine Learn. 24(2):123–140.Google Scholar
  • Breiman L (2001) Random forests. Machine Learn. 45(1):5–32.Google Scholar
  • Buolamwini J, Gebru T (2018) Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Conf. on Fairness, Accountability and Transparency (Association for Computing Machinery, New York), 77–91.Google Scholar
  • Buse A (1992) The bias of instrumental variable estimators. Econometrica 60(1):173–180.Google Scholar
  • Buzas JS, Stefanski LA (1996) Instrumental variable estimation in generalized linear measurement error models. J. Amer. Statist. Assoc. 91(435):999–1006.Google Scholar
  • Carroll RJ, Stefanski LA (1994) Measurement error, instrumental variables and corrections for attenuation with applications to meta-analyses. Statist. Medicine 13(12):1265–1282.Google Scholar
  • Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK (2017) Double/debiased/neyman machine learning of treatment effects. Amer. Econom. Rev. 107(5):261–265.Google Scholar
  • Conley TG, Hansen CB, Rossi PE (2012) Plausibly exogenous. Rev. Econom. Statist. 94(1):260–272.Google Scholar
  • Cook J, Stefanski L (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89(428):1314–1328.Google Scholar
  • Denisko D, Hoffman MM (2018) Classification and interaction in random forests. Proc. National Acad. Sci. USA 115(8):1690–1692.Google Scholar
  • Ebbes P, Wedel M, Böckenholt U (2009) Frugal iv alternatives to identify the parameter for an endogenous regressor. J. Appl. Econometrics 24(3):446–468.Google Scholar
  • Ebbes P, Wedel M, Böckenholt U, Steerneman T (2005) Solving and testing for regressor-error (in) dependence when no instrumental variables are available: With new evidence for the effect of education on income. Quant. Marketing Econom. 3(4):365–392.Google Scholar
  • Ellis PD (2010) The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results (Cambridge University Press, Cambridge, UK).Google Scholar
  • Fanaee-T H, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Progress Artificial Intelligence 2(2-3):113–127.Google Scholar
  • Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J. Machine Learn. Res. 15(1):3133–3181.Google Scholar
  • Fong C, Tyler M (2021) Machine learning predictions as regression covariates. Political Anal. 29(4):467–484.Google Scholar
  • Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Proc. 13th Internat. Conf. Internat. Conf. Machine Learn. (ACM, New York), 148–156.Google Scholar
  • Frisch R, Waugh FV (1933) Partial time regressions as compared with individual trends. Econometrica 1(4):387–401.Google Scholar
  • Gebru T, Krause J, Wang Y, Chen D, Deng J, Aiden EL, Fei-Fei L(2017) Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proc. National Acad. Sci. USA 114(50):13108–13113.Google Scholar
  • Ghose A, Ipeirotis PG (2010) Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Trans. Knowledge Data Engrg. 23(10):1498–1512.Google Scholar
  • Ghose A, Ipeirotis PG, Li B (2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Sci. 31(3):493–520.LinkGoogle Scholar
  • Giot R, Cherrier R (2014) Predicting bikeshare system usage up to one day ahead. Proc. IEEE Sympos. on Comput. Intelligence in Vehicles and Transportation Systems (IEEE, New York), 22–29.Google Scholar
  • Goh KY, Heng CS, Lin Z (2013) Social media brand community and consumer behavior: Quantifying the relative impact of user-and marketer-generated content. Inform. Systems Res. 24(1):88–107.LinkGoogle Scholar
  • Goodfellow I, Bengio Y, Courville A (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
  • Grace YY (2016) Statistical Analysis with Measurement Error or Misclassification (Springer, Berlin).Google Scholar
  • Greene WH (2003) Econometric Analysis (Pearson Education India).Google Scholar
  • Gu B, Konana P, Raghunathan R, Chen HM (2014) Research note-the allure of homophily in social media: Evidence from investor responses on virtual communities. Inform. Systems Res. 25(3):604–617.LinkGoogle Scholar
  • Gu B, Konana P, Rajagopalan B, Chen HWM (2007) Competition among virtual communities and user valuation: The case of investing-related communities. Inform. Systems Res. 18(1):68–85.LinkGoogle Scholar
  • Gustafson P (2003) Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments (CRC Press, Boca Raton, FL).Google Scholar
  • Györfi L, Kohler M, Krzyzak A, Walk H (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, New York).Google Scholar
  • Hausman JA (1978) Specification tests in econometrics. Econometrica 46(6):1251–1271.Google Scholar
  • Hausman J (2001) Mismeasured variables in econometric analysis: Problems from the right and problems from the left. J. Econom. Perspective 15(4):57–67.Google Scholar
  • Hausman JA, Newey WK, Powell JL (1995) Nonlinear errors in variables estimation of some engel curves. J. Econometrics 65(1):205–233.Google Scholar
  • Hu Y, Schennach SM (2008) Instrumental variable treatment of nonclassical measurement error models. Econometrica 76(1):195–216.Google Scholar
  • Jelveh Z, Kogut B, Naidu S (2015) Political language in economics. Working paper.Google Scholar
  • Küchenhoff H, Lederer W, Lesaffre E (2007) Asymptotic variance estimation for the misclassification SIMEX. Comput. Statist. Data Anal. 51(12):6197–6211.Google Scholar
  • Küchenhoff H, Mwalili SM, Lesaffre E (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85–96.Google Scholar
  • Lee JD, Sun DL, Sun Y, Taylor JE (2016) Exact post-selection inference, with application to the lasso. Ann. Statist. 44(3):907–927.Google Scholar
  • Lewbel A (2019) Using instrumental variables to estimate models with mismeasured regressors. Working paper.Google Scholar
  • Liu Y, Chen R, Chen Y, Mei Q, Salib S (2012)” i loan because…” understanding motivations for pro-social lending. Proc. 5th ACM Internat. Conf. on Web Search and Data Mining, 503–512.Google Scholar
  • Loken E, Gelman A (2017) Measurement error and the replication crisis. Science 355(6325):584–585.Google Scholar
  • Lu Y, Jerath K, Singh PV (2013) The emergence of opinion leaders in a networked online community: A dyadic model with time dynamics and a heuristic for fast estimation. Management Sci. 59(8):1783–1799.LinkGoogle Scholar
  • Mammen E, Rothe C, Schienle M (2016) Semiparametric estimation with generated covariates. Econometric Theory 32(5):1140–1177.Google Scholar
  • Mammen E, Rothe C, Schienle M (2012) Nonparametric regression with nonparametrically generated covariates. Ann. Statist. 40(2):1132–1170.Google Scholar
  • McFowland III E, Somanchi S, Neill DB (2018) Efficient discovery of heterogeneous treatment effects in randomized experiments via anomalous pattern detection. Preprint, submitted March 24, https://arxiv.org/abs/1803.09159.Google Scholar
  • Meng L, Wu B, Zhan Z (2016) Linear regression with an estimated regressor: Applications to aggregate indicators of economic development. Empirical Econom. 50(2):299–316.Google Scholar
  • Moreno A, Terwiesch C (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865–886.LinkGoogle Scholar
  • Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62:22–31.Google Scholar
  • Murphy KM, Topel RH (1985) Estimation and inference in two-step econometric models. J. Bus. Econom. Statist. 20(1):88–97.Google Scholar
  • Murray MP (2006) Avoiding invalid instruments and coping with weak instruments. J. Econom. Perspective 20(4):111–132.Google Scholar
  • Nagar AL (1959) The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica 27(4):575–595.Google Scholar
  • Newey WK (1984) A method of moments interpretation of sequential estimators. Econom. Lett. 14(2-3):201–206.Google Scholar
  • Oxley L, McAleer M (1993) Econometric issues in macroeconomic models with generated regressors. J. Econom. Survery 7(1):1–40.Google Scholar
  • Pagan A (1984) Econometric issues in the analysis of regressions with generated regressors. Internat. Econom. Rev. 25(1):221–247.Google Scholar
  • Roodman D (2009) A note on the theme of too many instruments. Oxf. Bull. Econom. Statist. 71(1):135–158.Google Scholar
  • Ryu JY, Kim HU, Lee SY (2018) Deep learning improves prediction of drug–drug and drug–food interactions. Proc. National Acad. Sci. USA 115(18):E4304–E4311.Google Scholar
  • Schennach SM (2016) Recent advances in the measurement error literature. Annu. Rev. Econom. 8:341–377.Google Scholar
  • Scornet E, Biau G, Vert JP, et al. (2015) Consistency of random forests. Ann. Statist. 43(4):1716–1741.Google Scholar
  • Seber GA (2009) Multivariate Observations, vol. 252 (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Singh PV, Sahoo N, Mukhopadhyay T (2014) How to attract and retain readers in enterprise blogging? Inform. Systems Res. 25(1):35–52.LinkGoogle Scholar
  • Sperlich S (2009) A note on non-parametric estimation with predicted variables. Econom. J. 12(2):382–395.Google Scholar
  • Taylor J, Tibshirani RJ (2015) Statistical learning and selective inference. Proc. National Acad. Sci. USA 112(25):7629–7634.Google Scholar
  • Tirunillai S, Tellis GJ (2012) Does chatter really matter? Dynamics of user-generated content and stock performance. Marketing Sci. 31(2):198–215.LinkGoogle Scholar
  • Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: A survey and results of new tests. Pattern Recognition 44(2):330–349.Google Scholar
  • Wang T, Kannan KN, Ulmer JR (2013) The association between the disclosure and the realization of information security risk factors. Inform. Systems Res. 24(2):201–218.LinkGoogle Scholar
  • Wooldridge JM (2002) Econometric Analysis of Cross Section and Panel Data (MIT Press, Cambridge, MA).Google Scholar
  • Yang M, Adomavicius G, Burtch G, Ren Y (2018) Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inform. Systems Res. 29(1):4–24.LinkGoogle Scholar
  • Zhu H, Kraut R, Kittur A (2012) Effectiveness of shared leadership in online communities. Proc. ACM Conf. on Comput. Supported Cooperative Work (Association for Computing Machinery, New York), 407–416.Google Scholar
  • Zhu H, Kraut RE, Wang YC, Kittur A (2011) Identifying shared leadership in wikipedia. Proc. SIGCHI Conf. on Human Factors in Comput. Systems (Association for Computing Machinery, New York), 3431–3434.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.