Textual Factors: A Scalable, Interpretable, and Data-Driven Approach to Analyzing Unstructured Information

Published Online:https://doi.org/10.1287/mnsc.2020.01180

We introduce a general approach for analyzing large-scale text-based data, combining the strengths of neural network language processing and generative statistical modeling to create a factor structure of unstructured data for downstream regressions typically used in social sciences. We generate textual factors by (i) representing texts using vector word embedding, (ii) clustering the vectors using locality-sensitive hashing to generate supports of topics, and (iii) identifying relatively interpretable spanning clusters (i.e., textual factors) through topic modeling. Our data-driven approach captures complex linguistic structures while ensuring computational scalability and economic interpretability, plausibly attaining certain advantages over and complementing other unstructured data analytics used by researchers, including emergent large language models. We conduct initial validation tests of the framework and discuss three types of its applications: (i) enhancing prediction and inference with texts, (ii) interpreting (non–text-based) models, and (iii) constructing new text-based metrics and explanatory variables. We illustrate each of these applications using examples in finance and economics such as macroeconomic forecasting from news articles, interpreting multifactor asset pricing models from corporate filings, and measuring theme-based technology breakthroughs from patents. Finally, we provide a flexible statistical package of textual factors for online distribution to facilitate future research and applications.

This paper was accepted by David Simchi-Levi, finance.

Funding: The authors gratefully acknowledge the financial support from the Ewing Marion Kauffman Foundation, the Becker Friedman Institute of Economics, the Fama-Miller Center for Research in Finance, INQUIRE Europe, the Kenan Institute of Private Enterprise, and the Risk Institute at OSU Fisher College of Business (while L. W. Cong was a fellow at the institute). W. Zhu acknowledges financial support from the Tsinghua University Initiative Scientific Research Program [Grant 2022Z04W02016], the Tsinghua University School of Economics and Management [Research Grant 2022051002], and the National Natural Science Foundation of China [Grant 72442014].

Supplemental Material: The online appendices and data files are available at https://doi.org/10.1287/mnsc.2020.01180.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.