Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Mengke Qiao
Corresponding Author
Mengke Qiao
[email protected]
https://orcid.org/0000-0002-0554-7916
International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China;
Search for more papers by this author
,
Ke-Wei Huang
Corresponding Author
Ke-Wei Huang
[email protected]
https://orcid.org/0000-0002-9932-6195
Department of Information Systems and Analytics, National University of Singapore, Singapore 117417
Search for more papers by this author

Corresponding Author

Mengke Qiao

International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China;

Search for more papers by this author

Ke-Wei Huang

Corresponding Author

Ke-Wei Huang

[email protected]

https://orcid.org/0000-0002-9932-6195

Department of Information Systems and Analytics, National University of Singapore, Singapore 117417

Search for more papers by this author

Published Online:26 Mar 2021https://doi.org/10.1287/isre.2020.0977

References

Aggarwal R , Gopal R , Gupta A , Singh H (2012) Putting money where the mouths are: The relation between venture financing and electronic word-of-mouth. Inform. Systems Res. 23(3-part-2):976–992.Google Scholar
Aigner DJ (1973) Regression with a binary independent variable subject to errors of observation. J. Econometrics 1(1):49–59.Crossref, Google Scholar
Balakrishnan R , Qiu XY , Srinivasan P (2010) On the predictive ability of narrative disclosures in annual reports. Eur. J. Oper. Res. 202(3):789–801.Crossref, Google Scholar
Bound J , Brown C , Duncan GJ , Rodgers WL (1994) Evidence on the validity of cross-sectional and longitudinal labor market data. J. Labor Econom. 12(3):345–368.Crossref, Google Scholar
Buonaccorsi JP (2010) Measurement Error: Models, Methods, and Applications (CRC Press, Boca Raton, FL). Crossref, Google Scholar
Carroll RJ , Ruppert D , Crainiceanu CM , Stefanski LA (2006) Measurement Error in Nonlinear Models: A Modern Perspective (Chapman and Hall/CRC, Boca Raton, FL). Crossref, Google Scholar
Caruana R , Niculescu-Mizil A (2004) Data mining in metric space: An empirical analysis of supervised learning performance criteria. Proc. 10th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 69–78. Google Scholar
Chan J , Wang J (2014) Hiring biases in online labor markets: The case of gender stereotyping. Proc. 35th Internat. Conf. Inform. Systems (ICIS), Auckland, New Zealand.Google Scholar
Chen H , Chiang RHL , Storey VC (2012) Business intelligence and analytics: from big data to big impact. Management Inform. Systems Quart. 36(4):1165.Crossref, Google Scholar
Chen T , Guestrin C (2016) Xgboost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 785–794.Google Scholar
Cook JR , Stefanski LA (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89(428):1314–1328.Crossref, Google Scholar
Geurts P (2009) Bias vs Variance Decomposition for Regression and Classification. Data Mining and Knowledge Discovery Handbook (Springer, New York).Google Scholar
Ghose A , Ipeirotis PG (2011) Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Trans. Knowledge Data Engrg. 23(10):1498–1512.Crossref, Google Scholar
Ghose A , Ipeirotis PG , Li B (2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Sci. 31(3):493–520.Link, Google Scholar
Goes PB , Lin M , Yeung CMA (2014) “Popularity effect” in user-generated content: Evidence from online product reviews. Inform. Systems Res. 25(2):222–238.Link, Google Scholar
Greene WH (2012) Econometric Analysis (Pearson, Boston). Google Scholar
Gu B , Konana P , Raghunathan R , Chen HWM (2014) Research note: The allure of homophily in social media: Evidence from investor responses on virtual communities. Inform. Systems Res. 25(3):604–617.Link, Google Scholar
Hausman JA (2001) Mismeasured variables in econometric analysis: Problems from the right and problems from the left. J. Econom. Perspective 15(4):57–67.Crossref, Google Scholar
Hausman JA , Abrevaya J , Scott-Morton FM (1998) Misclassification of the dependent variable in a discrete-response setting. J. Econometrics 87(2):239–269.Crossref, Google Scholar
Huang AH , Zang AY , Zheng R (2014) Evidence on the information content of text in analyst reports. Accounting Rev. 89(6):2151–2180.Crossref, Google Scholar
Kim J , Park J (2017) Does facial expression matter even online? An empirical analysis of facial expression of emotion and crowdfunding success. Proc. 38th Internat. Conf. Inform. Systems (ICIS), Seoul, South Korea.Google Scholar
Küchenhoff H , Mwalili SM , Lesaffre E (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85–96.Crossref, Google Scholar
Kumar BS , Ravi V (2016) A survey of the applications of text mining in financial domain. Knowledge Base. Systems 114:128–147.Crossref, Google Scholar
Li F (2010) Textual analysis of corporate disclosures: A survey of the literature. J. Accounting Literature 29:143.Google Scholar
McAuley JJ , Leskovec J (2013) From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. Proc. 22nd Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 897–908.Google Scholar
Moreno A , Terwiesch C (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865–886.Link, Google Scholar
Mousavi R , Raghu T , Frey K (2015) Assessing order effects in online community-based health forums. Proc. 36th Internat. Conf. Inform. Systems (ICIS), Fort Worth, TX.Google Scholar
Provost FJ , Fawcett T , Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Proc. 15th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco), 445–453.Google Scholar
Singh PV , Sahoo N , Mukhopadhyay T (2014) How to attract and retain readers in enterprise blogging? Inform. Systems Res. 25(1):35–52.Link, Google Scholar
Spiegelman D , Rosner B , Logan R (2000) Estimation and inference for logistic regression with covariate misclassification and measurement error in main study/validation study designs. J. Amer. Statist. Assoc. 95(449):51–61.Crossref, Google Scholar
Wang T , Kannan KN , Ulmer JR (2013) The association between the disclosure and the realization of information security risk factors. Inform. Systems Res. 24(2):201–218.Link, Google Scholar
Witten IH , Frank E , Hall MA , Pal CJ (2016) Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, Cambridge, MA). Google Scholar
Wulczyn E , Thain N , Dixon L (2016) Wikipedia detox. figshare. Accessed February 23, 2017, http://doi.org/10.6084/m9.figshare.4054689.Google Scholar
Yang M , Adomavicius G , Burtch G , Ren Y (2018) Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inform. Systems Res. 29(1):4–24.Link, Google Scholar
Zhang S , Lee D , Singh PV , Srinivasan K (2016) How much is an image worth? An empirical analysis of property’s image aesthetic quality on demand at AirBNB. Proc. 37th Internat. Conf. on Inform. Systems (ICIS, Dublin, Ireland).Google Scholar

cover image Information Systems Research

Volume 32, Issue 2

June 2021

Pages iii-vii, 301-674, C2

Article Information

Supplemental Material

Metrics

Information

Received:December 27, 2018
Accepted:August 10, 2020
Published Online:March 26, 2021

Cite as

Mengke Qiao, Ke-Wei Huang (2021) Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining. Information Systems Research 32(2):462-480.

https://doi.org/10.1287/isre.2020.0977

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

References

Volume 32, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News