Textual Factors: A Scalable, Interpretable, and Data-Driven Approach to Analyzing Unstructured Information

Lin William Cong
Lin William Cong
[email protected]
https://orcid.org/0000-0002-2617-2367
SC Johnson College of Business (Johnson), Cornell University, Ithaca, New York 14850; and International School of Finance, Fudan University, Shanghai 200001, China; and Asian Bureau of Financial and Economic Research (ABFER), Singapore 117592; and National Bureau of Economic Research, Cambridge, Massachusetts 02138
Search for more papers by this author
,
Tengyuan Liang
Tengyuan Liang
[email protected]
Booth School of Business, University of Chicago, Chicago, Illinois 60637
Search for more papers by this author
,
Xiao Zhang
Xiao Zhang
[email protected]
Compass Lexecon LLC, Chicago, Illinois 60601
Search for more papers by this author
,
Wu Zhu
Corresponding Author
Wu Zhu
[email protected]
https://orcid.org/0009-0002-6855-0618
School of Economics and Management, Tsinghua University, Beijing 100084, China
Search for more papers by this author

SC Johnson College of Business (Johnson), Cornell University, Ithaca, New York 14850; and International School of Finance, Fudan University, Shanghai 200001, China; and Asian Bureau of Financial and Economic Research (ABFER), Singapore 117592; and National Bureau of Economic Research, Cambridge, Massachusetts 02138

Search for more papers by this author

Tengyuan Liang

[email protected]

Booth School of Business, University of Chicago, Chicago, Illinois 60637

Search for more papers by this author

Xiao Zhang

[email protected]

Compass Lexecon LLC, Chicago, Illinois 60601

Search for more papers by this author

Wu Zhu

Corresponding Author

Wu Zhu

[email protected]

https://orcid.org/0009-0002-6855-0618

School of Economics and Management, Tsinghua University, Beijing 100084, China

Search for more papers by this author

Published Online:14 Oct 2025https://doi.org/10.1287/mnsc.2020.01180

References

Acikalin U, Caskurlu T, Hoberg G, Phillips GM (2023) Intellectual property protection lost and competition: An examination using large language models. Working paper, Tuck School of Business, Dartmouth College, Hanover, NH.Google Scholar
Andoni A, Indyk P, Laarhoven T, Razenshteyn I, Schmidt L (2015) Practical and optimal LSH for angular distance. Proc. 29th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1225–1233.Google Scholar
Antweiler W, Frank MZ (2004) Is all that talk just noise? The information content of internet stock message boards. J. Finance 59(3):1259–1294.Crossref, Google Scholar
Baker SR, Bloom N, Davis SJ (2016) Measuring economic policy uncertainty. Quart. J. Econom. 131(4):1593–1636.Crossref, Google Scholar
Bellstam G, Bhagat S, Cookson JA (2021) A text-based analysis of corporate innovation. Management Sci. 67(7):4004–4031.Link, Google Scholar
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J. Machine Learn. Res. 3(Feb):1137–1155.Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J. Machine Learn. Res. 3(1):993–1022.Google Scholar
Bodnaruk A, Loughran T, McDonald B (2015) Using 10-k text to gauge financial constraints. J. Financial Quant. Anal. 50(4):623–646.Crossref, Google Scholar
Brown SV, Tucker JW (2011) Large-sample evidence on firms’ year-over-year MD&A modifications. J. Accounting Res. 49(2):309–346.Crossref, Google Scholar
Buehlmaier MM, Whited TM (2018) Are financial constraints priced? Evidence from textual analysis. Rev. Financial Stud. 31(7):2693–2728.Crossref, Google Scholar
Chen Y, Kelly BT, Xiu D (2024a) Expected returns and large language models. Working paper, Booth School of Business, University of Chicago, Chicago.Google Scholar
Chen MA, Wu Q, Yang B (2019) How valuable is fintech innovation? Rev. Financial Stud. 32(5):2062–2106.Crossref, Google Scholar
Chen J, Tang G, Zhou G, Zhu W (2024b) ChatGPT, stock market predictability and links to the macroeconomy. Working paper, John M. Olin Business School, Washington University in St. Louis, St. Louis.Google Scholar
Cherepanov V, Shi F, Zakolyukina A (2024) Fraud culture. Working paper, Booth School of Business, University of Chicago, Chicago.Google Scholar
Cohen L, Malloy C, Nguyen Q (2020) Lazy prices. J. Finance 75(3):1371–1415.Crossref, Google Scholar
Cong LW, Liang T, Yang B, Zhang X (2021) Chapter 10: Analyzing textual information at scale. Balachandran K, ed. Information for Efficient Decision Making: Big Data, Blockchain and Relevance (World Scientific Publishing, Singapore), 239–271.Google Scholar
Cong LW, Tang K, Wang J, Zhang Y (2020) AlphaPortfolio: Direct construction through reinforcement learning and interpretable AI. Preprint, submitted April 20, http://dx.doi.org/10.2139/ssrn.3554486.Google Scholar
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. Proc. 20th Annual Sympos. Comput. Geometry (Association for Computing Machinery, New York), 253–262.Google Scholar
Engelberg JE, Parsons CA (2011) The causal impact of media in financial markets. J. Finance 66(1):67–97.Crossref, Google Scholar
Evans JA, Aceves P (2016) Machine translation: Mining text for social theory. Annu. Rev. Sociol. 42:21–50.Crossref, Google Scholar
Gentzkow M, Shapiro JM (2010) What drives media slant? Evidence from US daily newspapers. Econometrica 78(1):35–71.Crossref, Google Scholar
Gentzkow M, Kelly B, Taddy M (2019) Text as data. J. Econom. Literature 57(3):535–574.Crossref, Google Scholar
Grimmer J, Stewart BM (2013) Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Anal. (Oxford) 21(3):267–297.Crossref, Google Scholar
Hanley KW, Hoberg G (2010) The information content of IPO prospectuses. Rev. Financial Stud. 23(7):2821–2864.Crossref, Google Scholar
Hanley KW, Hoberg G (2012) Litigation risk, strategic disclosure and the underpricing of initial public offerings. J. Financial Econom. 103(2):235–254.Crossref, Google Scholar
Hanley KW, Hoberg G (2019) Dynamic interpretation of emerging risks in the financial sector. Rev. Financial Stud. 32(12):4543–4603.Crossref, Google Scholar
Hassan TA, Hollander S, Van Lent L, Tahoun A (2019) Firm-level political risk: Measurement and effects. Quart. J. Econom. 134(4):2135–2202.Crossref, Google Scholar
Hoberg G, Maksimovic V (2015) Redefining financial constraints: A text-based analysis. Rev. Financial Stud. 28(5):1312–1352.Crossref, Google Scholar
Hoberg G, Manela A (2024) The natural language of finance. Working paper, Marshall School of Business, University of Southern California, Los Angeles.Google Scholar
Hoberg G, Phillips G (2010) Product market synergies and competition in mergers and acquisitions: A text-based analysis. Rev. Financial Stud. 23(10):3773–3811.Crossref, Google Scholar
Hoberg G, Phillips G (2016) Text-based network industries and endogenous product differentiation. J. Political Econom. 124(5):1423–1465.Crossref, Google Scholar
Hoberg G, Knoblock C, Phillips G, Pujara J, Qiu Z, Raschid L (2024) Using representation learning and web text to identify competitor networks. Working paper, Tuck School of Business, Dartmouth College, Hanover, NH.Google Scholar
Jegadeesh N, Wu D (2013) Word power: A new approach for content analysis. J. Financial Econom. 110(3):712–729.Crossref, Google Scholar
Kelly B, Manela A, Moreira A (2021a) Text selection. J. Bus. Econom. Statist. 39(4):859–879.Crossref, Google Scholar
Kelly B, Papanikolaou D, Seru A, Taddy M (2021b) Measuring technological innovation over the long run. Amer. Econom. Rev. Insights 3(3):303–320.Crossref, Google Scholar
Kogan L, Papanikolaou D, Schmidt LD, Seegmiller B (2023) Technology and labor displacement: Evidence from linking patents with worker-level data. Working paper, Sloan School of Management, Massachusetts Institute of Technology, Boston.Google Scholar
Leskovec J, Rajaraman A, Ullman JD (2020) Mining of Massive Data Sets (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Li K, Mai F, Shen R, Yan X (2021) Measuring corporate culture using machine learning. Rev. Financial Stud. 34(7):3265–3315.Crossref, Google Scholar
Lopez-Lira A, Tang Y (2023) Can ChatGPT forecast stock price movements? return predictability and large language models. Preprint, submitted April 15, https://arxiv.org/abs/2304.07619.Google Scholar
Loughran T, McDonald B (2013) IPO first-day returns, offer price revisions, volatility, and form S-1 language. J. Financial Econom. 109(2):307–326.Crossref, Google Scholar
Manela A, Moreira A (2017) News implied volatility and disaster concerns. J. Financial Econom. 123(1):137–162.Crossref, Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Proc. 27th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 3111–3119.Google Scholar
Schwenkler G, Zheng H (2020a) Competition or contagion? evidence from cryptocurrency peers. Working paper, Leavey School of Business, Santa Clara University, Santa Clara, CA.Google Scholar
Schwenkler G, Zheng H (2020b) The network of firms implied by the news. Working paper, Leavey School of Business, Santa Clara University, Santa Clara, CA.Google Scholar
Schwenkler G, Zheng H (2024) Why does news coverage predict returns? evidence from the underlying editor preferences for risky stocks. Working paper, Leavey School of Business, Santa Clara University, Santa Clara, CA.Google Scholar
Sontag D, Roy DM (2011) Complexity of inference in latent Dirichlet allocation. Proc. 25th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1008–1016.Google Scholar
Streltsov A (2025) Generating exposures with large language models: Insights into M&A activity. Working paper, School of Management University at Buffalo The State University of New York, Buffalo, NY.Google Scholar
Tetlock PC (2007) Giving content to investor sentiment: The role of media in the stock market. J. Finance 62(3):1139–1168.Crossref, Google Scholar
Wallach HM, Mimno D, McCallum A (2009) Rethinking LDA: Why priors matter. Proc. 23rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1973–1981. Google Scholar
Xu W, Kotecha MC, McAdams DA (2024) How good is ChatGPT? An exploratory study on ChatGPT’s performance in engineering design tasks and subjective decision-making. Proc. Design Soc., vol. 4 (Cambridge University Press, Cambridge, UK), 2307–2316.Google Scholar
Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, Huang X, et al. (2023) Siren’s song in the AI ocean: A survey on hallucination in large language models. Preprint, submitted September 3, https://arxiv.org/abs/2309.01219.Google Scholar

Volume 71, Issue 12

December 2025

Pages vii-x, 9869-10753, iv-vi

Article Information

Supplemental Material

Metrics

Information

Received:April 25, 2020
Accepted:April 09, 2025
Published Online:October 14, 2025

Cite as

Lin William Cong, Tengyuan Liang, Xiao Zhang, Wu Zhu (2025) Textual Factors: A Scalable, Interpretable, and Data-Driven Approach to Analyzing Unstructured Information. Management Science 71(12):10727-10739.

https://doi.org/10.1287/mnsc.2020.01180

Keywords

Acknowledgments

The authors thank Agostino Capponi, Gerard Hoberg, and Gustavo Schwenkler for detailed comments and direction; Chunrong Ai, Kwan Chen, Tony Cookson, Tarek Hassan, Shiyang Huang, Sanya Kohli, Kai Li, Alejandro Lopez-Lira Nadya Malenko, Alan Moreira, Deniz Okat, Lubos Pastor, Lauren Sutioso, George Tauchen, Baozhong Yang, Weiyi Zhao, and seminar and conference participants at the AEA/CERNA Joint Meeting, Ansatz Capital, Conference on Big Data, Machine Learning and AI in Economics, Baidu Du Xiaoman Financial, DataYes/KDD China AI x FinTech Workshop, Erasmus University (Rotterdam), Financial Intermediation Research Society Annual Conference (Savannah), Global Digital Economy Summit for Small and Medium Enterprises (DES2020), Guanghua International Symposium, University of Hong Kong, Hong Kong University of Science and Technology, INQUIRE Europe Autumn Seminar (Krakow), IIF International Research Conference & Award Summit (Delhi), JD.com JDD (Financial Arm), Kenan Institute Frontiers of Entrepreneurship Conference, Nanyang Technological University, New Technologies in Finance Conference (Columbia Graduate School of Business), 1st NY Fed FinTech Research Conference, Singapore Management University, Tilburg University, the Second Toronto FinTech Conference, and the Zhongnan University of Economics and Law for feedback and suggestions; Michael Fortunato, Fujie Wang, Oliver Xie, and Guanyu Zhou for excellent research and programming assistance; and Shuyan Huang, Chloe Shin, Raj Shukla, Ellis Soodak, Jiashu Sun, Sourabh Velaga, and Connie Xu for research assistance. The contents of this publication are solely the responsibility of the authors.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Textual Factors: A Scalable, Interpretable, and Data-Driven Approach to Analyzing Unstructured Information

References

Volume 71, Issue 12

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News