How Much Can Machines Learn Finance from Chinese Text Data?

Yang Zhou
Yang Zhou
[email protected]
https://orcid.org/0000-0003-2698-6077
Institute for Big Data, Fudan University, Shanghai 200433, China;MOE Laboratory for National Development and Intelligent Governance, Fudan University, Shanghai 200433, China;
Search for more papers by this author
,
Jianqing Fan
Corresponding Author
Jianqing Fan
[email protected]
https://orcid.org/0000-0003-3250-7677
International School of Economics and Management, Capital University of Economics and Business, Beijing 100070, China;Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544;School of Data Science, Fudan University, Shanghai 200433, China
Search for more papers by this author
,
Lirong Xue
Lirong Xue
[email protected]
Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544;
Search for more papers by this author

Institute for Big Data, Fudan University, Shanghai 200433, China;MOE Laboratory for National Development and Intelligent Governance, Fudan University, Shanghai 200433, China;

Search for more papers by this author

Jianqing Fan

Corresponding Author

Jianqing Fan

[email protected]

https://orcid.org/0000-0003-3250-7677

International School of Economics and Management, Capital University of Economics and Business, Beijing 100070, China;Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544;School of Data Science, Fudan University, Shanghai 200433, China

Search for more papers by this author

Lirong Xue

[email protected]

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544;

Search for more papers by this author

Published Online:18 Mar 2024https://doi.org/10.1287/mnsc.2022.01468

References

Ahn SC, Horenstein AR (2013) Eigenvalue ratio test for the number of factors. Econometrica 81(3):1203–1227.Crossref, Google Scholar
Antweiler W, Frank MZ (2004) Is all that talk just noise? The information content of internet stock message boards. J. Finance 59(3):1259–1294.Crossref, Google Scholar
Arkhangelsky D, Athey S, Hirshberg DA, Imbens GW, Wager S (2021) Synthetic difference-in-differences. Amer. Econom. Rev. 111(12):4088–4118.Crossref, Google Scholar
Bai Z, Ding X (2012) Estimation of spiked eigenvalues in spiked models. Random Matrices Theory Appl. 1(02):1–21.Crossref, Google Scholar
Bai J, Ng S (2002) Determining the number of factors in approximate factor models. Econometrica 70(1):191–221.Crossref, Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J. Machine Learn. Res. 3(January):993–1022.Google Scholar
Calomiris CW, Mamaysky H (2019) How news and its context drive risk and returns around the world. J. Financial Econom. 133(2):299–336.Crossref, Google Scholar
Carhart MM (1997) On persistence in mutual fund performance. J. Finance 52(1):57–82.Crossref, Google Scholar
Chen Y (2015) Convolutional neural network for sentence classification. UWSpace (August 26), https://uwspace.uwaterloo.ca/handle/10012/9592.Google Scholar
Chen J, Jiang F, Tu J (2015) Asset allocation in the Chinese stock market: The role of return predictability. J. Portfolio Management 41(5):71–83.Crossref, Google Scholar
Chen T, Gao Z, He J, Jiang W, Xiong W (2019) Daily price limits and destructive market behavior. J. Econometrics 208(1):249–264.Crossref, Google Scholar
Cong LW, Liang T, Zhang X (2019) Textual factors: A scalable, interpretable, and data-driven approach to analyzing unstructured information. Preprint, submitted September 1, https://dx.doi.org/10.2139/ssrn.3307057.Google Scholar
Cowles A (1933) Can stock market forecasters forecast? Econometrica 1(3):309–324.Crossref, Google Scholar
Da Z, Engelberg J, Gao P (2015) The sum of all FEARS investor sentiment and asset prices. Rev. Financial Stud. 28(1):1–32.Crossref, Google Scholar
Deng K, Bol PK, Li KJ, Liu JS (2016) On the unsupervised analysis of domain-specific Chinese texts. Proc. Natl. Acad. Sci. USA 113(22):6154–6159.Crossref, Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint, submitted May 24, https://arxiv.org/abs/1810.04805.Google Scholar
Du Z, Huang AG, Wermers R, Wu W (2022) Language and domain specificity: A Chinese financial sentiment dictionary. Rev. Finance 26(3):673–719.Crossref, Google Scholar
Fama EF, French KR (1993) Common risk factors in the returns on stocks and bonds. J. Financial Econom. 33(1):3–56.Crossref, Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J. Roy. Statist. Soc. Ser. B Statist. Methodology 70(5):849–911.Crossref, Google Scholar
Fan J, Guo J, Zheng S (2020a) Estimating number of factors by adjusted eigenvalues thresholding. J. Amer. Statist. Assoc. 117(538):852–861.Crossref, Google Scholar
Fan J, Ke Y, Wang K (2020b) Factor-adjusted regularized model selection. J. Econometrics 216(1):71–85.Crossref, Google Scholar
Fan J, Li R, Zhang C-H, Zou H (2020c) Statistical Foundations of Data Science (CRC Press, Boca Raton, FL).Crossref, Google Scholar
Gao Z, Ren H, Zhang B (2020) Googling investor sentiment around the world. J. Financial Quant. Anal. 55(2):549–580.Crossref, Google Scholar
García D (2013) Sentiment during recessions. J. Finance 68(3):1267–1300.Crossref, Google Scholar
Gentzkow M, Kelly B, Taddy M (2019a) Text as data. J. Econom. Literature 57(3):535–574.Crossref, Google Scholar
Gentzkow M, Shapiro JM, Taddy M (2019b) Measuring group differences in high-dimensional choices: Method and application to congressional speech. Econometrica 87(4):1307–1340.Crossref, Google Scholar
Glasserman P, Mamaysky H (2019) Does unusual news forecast market stress? J. Financial Quant. Anal. 54(5):1937–1974.Crossref, Google Scholar
Gu S, Kelly B, Xiu D (2020) Empirical asset pricing via machine learning. Rev. Financial Stud. 33(5):2223–2273.Crossref, Google Scholar
Henry E (1973) Are investors influenced by how earnings press releases are written? J. Bus. Comm. 45(4):363–407.Crossref, Google Scholar
Horel E, Giesecke K (2020) Significance tests for neural networks. J. Machine Learn. Res. 21(227):1–29.Google Scholar
Jegadeesh N, Wu D (2013) Word power: A new approach for content analysis. J. Financial Econom. 110(3):712–729.Crossref, Google Scholar
Ke ZT, Kelly BT, Xiu D (2019) Predicting returns with text data. NBER Working Paper No. 26186, National Bureau of Economic Research, Cambridge, MA.Google Scholar
Larsen V, Thorsrud LA (2017) Asset returns, news topics, and media effects. Preprint, submitted September 19, https://dx.doi.org/10.2139/ssrn.3057950.Google Scholar
Loughran T, McDonald B (2011) When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J. Finance 66(1):35–65.Crossref, Google Scholar
Loughran T, McDonald B (2016) Textual analysis in accounting and finance: A survey. J. Accounting Res. 54(4):1187–1230.Crossref, Google Scholar
Manela A, Moreira A (2017) News implied volatility and disaster concerns. J. Financial Econom. 123(1):137–162.Crossref, Google Scholar
Nagel S (2005) Short sales, institutional investors and the cross-section of stock returns. J. Financial Econom. 78(2):277–309.Crossref, Google Scholar
Nagel S (2021) Machine Learning in Asset Pricing (Princeton University Press, Princeton, NJ).Google Scholar
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans. Signal Processing 45(11):2673–2681.Crossref, Google Scholar
Stock JH, Watson MW (2002) Forecasting using principal components from a large number of predictors. J. Amer. Statist. Assoc. 97(460):1167–1179.Crossref, Google Scholar
Sun J (2017) Jieba Version v0.39 (August 31). https://github.com/fxsjy/jieba.Google Scholar
Sun L, Najand M, Shen J (2016) Stock return predictability and investor sentiment: A high-frequency perspective. J. Banking Finance 73(11):147–164.Crossref, Google Scholar
Taddy M (2013) Multinomial inverse regression for text analysis. J. Amer. Statist. Assoc. 108(503):755–770.Crossref, Google Scholar
Tetlock PC (2007) Giving content to investor sentiment: The role of media in the stock market. J. Finance 62(3):1139–1168.Crossref, Google Scholar
Tetlock PC, Saar-Tsechansky M, Macskassy S (2008) More than words: Quantifying language to measure firms’ fundamentals. J. Finance 63(3):1437–1467.Crossref, Google Scholar

Volume 70, Issue 12

December 2024

Pages 8217-9119, iv-vi

Article Information

Supplemental Material

Metrics

Information

Received:February 16, 2021
Accepted:July 18, 2023
Published Online:March 18, 2024

Cite as

Yang Zhou, Jianqing Fan, Lirong Xue (2024) How Much Can Machines Learn Finance from Chinese Text Data?. Management Science 70(12):8962-8987.

https://doi.org/10.1287/mnsc.2022.01468

Keywords

Acknowledgments

The authors are grateful for various comments and suggestions made by Shuyi Ge, Oliver Linton, Stefan Nagel, Wei Xiong, Dacheng Xiu, and anonymous reviewers among others. The authors also acknowledge the research assistance by Yuan Gao and Danchun Chen.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

How Much Can Machines Learn Finance from Chinese Text Data?

References

Volume 70, Issue 12

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News