Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem
Published Online:21 Sep 2022https://doi.org/10.1287/ijds.2022.0019
References
- (2015) Data Mining: The Textbook (Springer, Berlin).Google Scholar
- (2012) Putting money where the mouths are: The relation between venture financing and electronic word-of-mouth. Inform. Systems Res. 23(3-part-2):976–992.Link, Google Scholar
- (1995) Split-sample instrumental variables estimates of the return to schooling. J. Bus. Econom. Statist. 13(2):225–235.Google Scholar
- (2008) Mostly Harmless Econometrics: An Empiricist’s Companion (Princeton University Press, Princeton, NJ).Google Scholar
- (2016) Machine bias. ProPublica May:23.Google Scholar
- (2016) Recursive partitioning for heterogeneous causal effects. Proc. National. Acad. Sci. USA 113(27):7353–7360.Google Scholar
- (2017) The state of applied econometrics: Causality and policy evaluation. J. Econom. Perspective 31(2):3–32.Google Scholar
- (2012) Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6):2369–2429.Google Scholar
- (2013) Valid post-selection inference. Ann. Statist. 41(2):802–837.Google Scholar
- (2012) Dynamic random forests. Pattern Recognition Lett. 33(12):1580–1586.Google Scholar
- (2010) A study of strength and correlation in random forests. Proc. Internat. Conf. on Intelligent Comput. (Springer, Berlin), 186–191.Google Scholar
- (2016) A random forest guided tour. TEST 25(2):197–227.Google Scholar
- (2008) Consistency of random forests and other averaging classifiers. J. Machine Learn. Res. 9(9).Google Scholar
- (1992) Unobserved ability, efficiency wages, and interindustry wage differentials. Quart. J. Econom. 107(4):1421–1436.Google Scholar
- (2016) Random rotation ensembles. J. Machine Learn. Res. 17(1):126–151.Google Scholar
- (2004) Endogeneity in semiparametric binary response models. Rev. Econom. Stud. 71(3):655–679.Google Scholar
- (1996) Bagging predictors. Machine Learn. 24(2):123–140.Google Scholar
- (2001) Random forests. Machine Learn. 45(1):5–32.Google Scholar
- (2018) Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Conf. on Fairness, Accountability and Transparency (Association for Computing Machinery, New York), 77–91.Google Scholar
- (1992) The bias of instrumental variable estimators. Econometrica 60(1):173–180.Google Scholar
- (1996) Instrumental variable estimation in generalized linear measurement error models. J. Amer. Statist. Assoc. 91(435):999–1006.Google Scholar
- (1994) Measurement error, instrumental variables and corrections for attenuation with applications to meta-analyses. Statist. Medicine 13(12):1265–1282.Google Scholar
- (2017) Double/debiased/neyman machine learning of treatment effects. Amer. Econom. Rev. 107(5):261–265.Google Scholar
- (2012) Plausibly exogenous. Rev. Econom. Statist. 94(1):260–272.Google Scholar
- (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89(428):1314–1328.Google Scholar
- (2018) Classification and interaction in random forests. Proc. National Acad. Sci. USA 115(8):1690–1692.Google Scholar
- (2009) Frugal iv alternatives to identify the parameter for an endogenous regressor. J. Appl. Econometrics 24(3):446–468.Google Scholar
- (2005) Solving and testing for regressor-error (in) dependence when no instrumental variables are available: With new evidence for the effect of education on income. Quant. Marketing Econom. 3(4):365–392.Google Scholar
- (2010) The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results (Cambridge University Press, Cambridge, UK).Google Scholar
- (2014) Event labeling combining ensemble detectors and background knowledge. Progress Artificial Intelligence 2(2-3):113–127.Google Scholar
- (2014) Do we need hundreds of classifiers to solve real world classification problems? J. Machine Learn. Res. 15(1):3133–3181.Google Scholar
- (2021) Machine learning predictions as regression covariates. Political Anal. 29(4):467–484.Google Scholar
- (1996) Experiments with a new boosting algorithm. Proc. 13th Internat. Conf. Internat. Conf. Machine Learn. (ACM, New York), 148–156.Google Scholar
- (1933) Partial time regressions as compared with individual trends. Econometrica 1(4):387–401.Google Scholar
- (2017) Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proc. National Acad. Sci. USA 114(50):13108–13113.Google Scholar
- (2010) Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Trans. Knowledge Data Engrg. 23(10):1498–1512.Google Scholar
- (2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Sci. 31(3):493–520.Link, Google Scholar
- (2014) Predicting bikeshare system usage up to one day ahead. Proc. IEEE Sympos. on Comput. Intelligence in Vehicles and Transportation Systems (IEEE, New York), 22–29.Google Scholar
- (2013) Social media brand community and consumer behavior: Quantifying the relative impact of user-and marketer-generated content. Inform. Systems Res. 24(1):88–107.Link, Google Scholar
- (2016) Deep Learning (MIT Press, Cambridge, MA).Google Scholar
- (2016) Statistical Analysis with Measurement Error or Misclassification (Springer, Berlin).Google Scholar
- (2003) Econometric Analysis (Pearson Education India).Google Scholar
- (2014) Research note-the allure of homophily in social media: Evidence from investor responses on virtual communities. Inform. Systems Res. 25(3):604–617.Link, Google Scholar
- (2007) Competition among virtual communities and user valuation: The case of investing-related communities. Inform. Systems Res. 18(1):68–85.Link, Google Scholar
- (2003) Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments (CRC Press, Boca Raton, FL).Google Scholar
- (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, New York).Google Scholar
- (1978) Specification tests in econometrics. Econometrica 46(6):1251–1271.Google Scholar
- (2001) Mismeasured variables in econometric analysis: Problems from the right and problems from the left. J. Econom. Perspective 15(4):57–67.Google Scholar
- (1995) Nonlinear errors in variables estimation of some engel curves. J. Econometrics 65(1):205–233.Google Scholar
- (2008) Instrumental variable treatment of nonclassical measurement error models. Econometrica 76(1):195–216.Google Scholar
- (2015) Political language in economics. Working paper.Google Scholar
- (2007) Asymptotic variance estimation for the misclassification SIMEX. Comput. Statist. Data Anal. 51(12):6197–6211.Google Scholar
- (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85–96.Google Scholar
- (2016) Exact post-selection inference, with application to the lasso. Ann. Statist. 44(3):907–927.Google Scholar
- (2019) Using instrumental variables to estimate models with mismeasured regressors. Working paper.Google Scholar
- (2012)” i loan because…” understanding motivations for pro-social lending. Proc. 5th ACM Internat. Conf. on Web Search and Data Mining, 503–512.Google Scholar
- (2017) Measurement error and the replication crisis. Science 355(6325):584–585.Google Scholar
- (2013) The emergence of opinion leaders in a networked online community: A dyadic model with time dynamics and a heuristic for fast estimation. Management Sci. 59(8):1783–1799.Link, Google Scholar
- (2016) Semiparametric estimation with generated covariates. Econometric Theory 32(5):1140–1177.Google Scholar
- (2012) Nonparametric regression with nonparametrically generated covariates. Ann. Statist. 40(2):1132–1170.Google Scholar
- (2018) Efficient discovery of heterogeneous treatment effects in randomized experiments via anomalous pattern detection. Preprint, submitted March 24, https://arxiv.org/abs/1803.09159.Google Scholar
- (2016) Linear regression with an estimated regressor: Applications to aggregate indicators of economic development. Empirical Econom. 50(2):299–316.Google Scholar
- (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865–886.Link, Google Scholar
- (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62:22–31.Google Scholar
- (1985) Estimation and inference in two-step econometric models. J. Bus. Econom. Statist. 20(1):88–97.Google Scholar
- (2006) Avoiding invalid instruments and coping with weak instruments. J. Econom. Perspective 20(4):111–132.Google Scholar
- (1959) The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica 27(4):575–595.Google Scholar
- (1984) A method of moments interpretation of sequential estimators. Econom. Lett. 14(2-3):201–206.Google Scholar
- (1993) Econometric issues in macroeconomic models with generated regressors. J. Econom. Survery 7(1):1–40.Google Scholar
- (1984) Econometric issues in the analysis of regressions with generated regressors. Internat. Econom. Rev. 25(1):221–247.Google Scholar
- (2009) A note on the theme of too many instruments. Oxf. Bull. Econom. Statist. 71(1):135–158.Google Scholar
- (2018) Deep learning improves prediction of drug–drug and drug–food interactions. Proc. National Acad. Sci. USA 115(18):E4304–E4311.Google Scholar
- (2016) Recent advances in the measurement error literature. Annu. Rev. Econom. 8:341–377.Google Scholar
- (2015) Consistency of random forests. Ann. Statist. 43(4):1716–1741.Google Scholar
- (2009) Multivariate Observations, vol. 252 (John Wiley & Sons, Hoboken, NJ).Google Scholar
- (2014) How to attract and retain readers in enterprise blogging? Inform. Systems Res. 25(1):35–52.Link, Google Scholar
- (2009) A note on non-parametric estimation with predicted variables. Econom. J. 12(2):382–395.Google Scholar
- (2015) Statistical learning and selective inference. Proc. National Acad. Sci. USA 112(25):7629–7634.Google Scholar
- (2012) Does chatter really matter? Dynamics of user-generated content and stock performance. Marketing Sci. 31(2):198–215.Link, Google Scholar
- (2011) Mining data with random forests: A survey and results of new tests. Pattern Recognition 44(2):330–349.Google Scholar
- (2013) The association between the disclosure and the realization of information security risk factors. Inform. Systems Res. 24(2):201–218.Link, Google Scholar
- (2002) Econometric Analysis of Cross Section and Panel Data (MIT Press, Cambridge, MA).Google Scholar
- (2018) Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inform. Systems Res. 29(1):4–24.Link, Google Scholar
- (2012) Effectiveness of shared leadership in online communities. Proc. ACM Conf. on Comput. Supported Cooperative Work (Association for Computing Machinery, New York), 407–416.Google Scholar
- (2011) Identifying shared leadership in wikipedia. Proc. SIGCHI Conf. on Human Factors in Comput. Systems (Association for Computing Machinery, New York), 3431–3434.Google Scholar

