Regurgitative Training: The Value of Real Data in Training Large Language Models

Jinghui Zhang
Jinghui Zhang
[email protected]
https://orcid.org/0009-0002-2438-8268
School of Economics and Management, Tsinghua University, Beijing 100084, China
Search for more papers by this author
,
Mochen Yang
Mochen Yang
[email protected]
https://orcid.org/0000-0001-5101-9041
Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455
Search for more papers by this author
,
Dandan Qiao
Corresponding Author
Dandan Qiao
[email protected]
https://orcid.org/0000-0002-7038-1940
School of Computing, National University of Singapore, Singapore 117418, Singapore
Search for more papers by this author
,
Qiang Wei
Qiang Wei
[email protected]
https://orcid.org/0000-0002-8397-7129
School of Economics and Management, Tsinghua University, Beijing 100084, China
Search for more papers by this author

School of Economics and Management, Tsinghua University, Beijing 100084, China

Search for more papers by this author

Mochen Yang

[email protected]

https://orcid.org/0000-0001-5101-9041

Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455

Search for more papers by this author

Dandan Qiao

Corresponding Author

Dandan Qiao

[email protected]

https://orcid.org/0000-0002-7038-1940

School of Computing, National University of Singapore, Singapore 117418, Singapore

Search for more papers by this author

Qiang Wei

[email protected]

https://orcid.org/0000-0002-8397-7129

School of Economics and Management, Tsinghua University, Beijing 100084, China

Search for more papers by this author

Published Online:11 May 2026https://doi.org/10.1287/mnsc.2024.07005

References

Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, et al. (2023) GPT-4 technical report. Preprint, submitted March 15, https://arxiv.org/abs/2303.08774.Google Scholar
Alemohammad S, Casco-Rodriguez J, Luzi L, Humayun AI, Babaei H, LeJeune D, Siahkoohi A, Baraniuk RG (2023) Self-consuming generative models go mad. Preprint, submitted July 4, https://arxiv.org/abs/2307.01850.Google Scholar
Anderson BR, Shah JH, Kreminski M (2024) Homogenization effects of large language models on human creative ideation. C&C ‘24 Proc. 16th Conf. Creativity Cognition, 413–425.Google Scholar
Anthropic (2024) The Claude 3 model family: Opus, Sonnet, Haiku. Working paper, Anthropic PBC, San Francisco, CA.Google Scholar
Bertrand Q, Bose AJ, Duplessis A, Jiralerspong M, Gidel G (2023) On the stability of iterative retraining of generative models on their own data. Preprint, submitted September 30, https://arxiv.org/abs/2310.00429.Google Scholar
Biever C (2023) ChatGPT broke the Turing test—The race is on for new ways to assess AI. Nature 619(7971):686–689.Crossref, Google Scholar
Bran AM, Cox S, Schilter O, Baldassari C, White AD, Schwaller P (2023) Chemcrow: Augmenting large-language models with chemistry tools. Preprint, submitted April 11, https://arxiv.org/abs/2304.05376.Google Scholar
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, et al. (2020) Language models are few-shot learners. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 1877–1901.Google Scholar
Chen Z, Chan J (2023) Large language model in creative work: The role of collaboration modality and user expertise. Preprint, submitted September 27, http://dx.doi.org/10.1287/mnsc.2023.03014.Google Scholar
Chen J, Mueller J (2023) Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. Preprint, submitted August 30, https://arxiv.org/abs/2308.16175.Google Scholar
Chen Z, Deng Y, Yuan H, Ji K, Gu Q (2024) Self-play fine-tuning converts weak language models to strong language models. Preprint, submitted January 2, https://arxiv.org/abs/2401.01335.Google Scholar
Chen M, Tworek J, Jun H, Yuan Q, Pinto HPdO, Kaplan J, Edwards H, et al. (2021) Evaluating large language models trained on code. Preprint, submitted July 7, https://arxiv.org/abs/2107.03374.Google Scholar
Chowdhury S, Baili N, Vannah B (2021) Ensemble fine-tuned mBERT for translation quality estimation. Preprint, submitted September 8, https://arxiv.org/abs/2109.03914.Google Scholar
Cubuk ED, Zoph B, Shlens J, Le QV (2020) Randaugment: Practical automated data augmentation with a reduced search space. Boult T, Medioni G, Zabih R, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition Workshops (IEEE Computer Society, Los Alamitos, CA), 702–703.Google Scholar
Doshi AR, Hauser O (2023) Generative artificial intelligence enhances creativity. Preprint, submitted August 14, http://dx.doi.org/10.1126/sciadv.adn5290.Google Scholar
Feng Y, Dohmatob E, Yang P, Charton F, Kempe J (2024a) A tale of tails: Model collapse as a change of scaling laws. Kim B, Yue Y, Chaudhuri S, Fragkiadaki K, Khan ME, Sun Y, eds. ICLR 2024 Workshop Navigating Addressing Data Problems Foundation Models (OpenReview, Amherst, MA).Google Scholar
Feng Y, Dohmatob E, Yang P, Charton F, Kempe J (2024b) Beyond model collapse: Scaling up with synthesized data requires reinforcement. Salakhutdinov R, Heller K, Kolter Z, Oliver N, Weller A, eds. ICML 2024 Workshop Theoret. Foundations Foundation Models (PMLR, New York), 12942–12968.Google Scholar
Gerstgrasser M, Schaeffer R, Dey A, Rafailov R, Sleight H, Hughes J, Korbak T, et al. (2024) Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. Preprint, submitted April 1, https://arxiv.org/abs/2404.01413.Google Scholar
Ghaffary S (2024) Microsoft, Google and Meta bet on fake data to build AI models. Bloomberg (May 2), https://www.bloomberg.com/news/newsletters/2024-05-02/microsoft-google-and-meta-bet-on-fake-data-to-train-ai-models.Google Scholar
Gong Z, Zhong P, Hu W (2019) Diversity in machine learning. IEEE Access 7:64323–64350.Crossref, Google Scholar
Huang N, Burtch G, Gu B, Hong Y, Liang C, Wang K, Fu D, Yang B (2019) Motivating user-generated content with performance feedback: Evidence from randomized field experiments. Management Sci. 65(1):327–345.Link, Google Scholar
Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. Hutchins J, ed. Proc. Machine Translation Summit X Papers (International Association for Machine Translation, Washington, DC), 79–86.Google Scholar
Kornish LJ, Ulrich KT (2014) The importance of the raw idea in innovation: Testing the sow’s ear hypothesis. J. Marketing Res. 51(1):14–26.Crossref, Google Scholar
Kotelanski M, Gallo R, Nayak A, Savage T (2023) Methods to estimate large language model confidence. Preprint, submitted November 28, https://arxiv.org/abs/2312.03733.Google Scholar
Kraut RE, Resnick P (2012) Building Successful Online Communities: Evidence-Based Social Design (MIT Press, Cambridge, MA).Crossref, Google Scholar
Kuang L, Huang N, Hong Y, Yan Z (2019) Spillover effects of financial incentives on non-incentivized user engagement: Evidence from an online knowledge exchange platform. J. Management Inform. Systems 36(1):289–320.Crossref, Google Scholar
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, et al. (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 9459–9474.Google Scholar
Li H, Dong Q, Tang Z, Wang C, Zhang X, Huang H, Huang S, et al. (2024) Synthetic data (almost) from scratch: Generalized instruction tuning for language models. Preprint, submitted February 20, https://arxiv.org/abs/2402.13064.Google Scholar
Lin Z, Trivedi S, Sun J (2023) Generating with confidence: Uncertainty quantification for black-box large language models. Preprint, submitted May 30, https://arxiv.org/abs/2305.19187.Google Scholar
McKinzie B, Gan Z, Fauconnier JP, Dodge S, Zhang B, Dufter P, Shah D, et al. (2024) Mm1: Methods, analysis & insights from multimodal LLM pre-training. Preprint, submitted March 14, https://arxiv.org/abs/2403.09611.Google Scholar
Meincke L, Girotra K, Nave G, Terwiesch C, Ulrich KT (2024) Using large language models for idea generation in innovation. Preprint, submitted August 2, http://dx.doi.org/10.2139/ssrn.4526071.Google Scholar
Miller GA (1995) Wordnet: A lexical database for English. Comm. ACM 38(11):39–41.Crossref, Google Scholar
Mims C (2024) The AI revolution is already losing steam. Wall Street J (May 31), https://www.wsj.com/tech/ai/the-ai-revolution-is-already-losing-steam-a93478b1.Google Scholar
Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. Agah A, Callan J, Rundensteiner E, Gauch S, eds. Proc. Ninth Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 86–93.Google Scholar
Noy S, Zhang W (2023) Experimental evidence on the productivity effects of generative artificial intelligence. Science 381(6654):187–192.Crossref, Google Scholar
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, et al. (2022) Training language models to follow instructions with human feedback. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 27730–27744.Google Scholar
Padmakumar V, He H (2023) Does writing with language models reduce content diversity? Preprint, submitted September 11, https://arxiv.org/abs/2309.05196.Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A method for automatic evaluation of machine translation. Isabelle P, Charniak E, Lin D, eds. Proc. 40th Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 311–318.Google Scholar
Pise NN, Kulkarni P (2008) A survey of semi-supervised learning methods. Zhao H, Deb K, eds. 2008 Internat. Conf. Comput. Intelligence Security, vol. 2 (IEEE, Piscataway, NJ), 30–34.Google Scholar
Rao R (2023) AI-generated data can poison future AI models. Sci. Amer. (July 28), https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/.Google Scholar
Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C (2023) Direct preference optimization: Your language model is secretly a reward model. Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 53728–53741.Google Scholar
Rawte V, Sheth A, Das A (2023) A survey of hallucination in large foundation models. Preprint, submitted September 12, https://arxiv.org/abs/2309.05922.Google Scholar
Rei R, Stewart C, Farinha AC, Lavie A (2020) COMET: A neural framework for MT evaluation. Preprint, submitted September 18, https://arxiv.org/abs/2009.09025.Google Scholar
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Preprint, submitted August 27, https://arxiv.org/abs/1908.10084.Google Scholar
Scudder H (1965) Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inform. Theory 11(3):363–371.Crossref, Google Scholar
Seetharaman D (2024) For data-guzzling AI companies, the internet is too small. Wall Street J. (April 1), https://www.wsj.com/tech/ai/ai-training-data-synthetic-openai-anthropic-9230f8d8.Google Scholar
Sejnowski TJ (2023) Large language models and the reverse Turing test. Neural Comput. 35(3):309–342.Crossref, Google Scholar
Sellam T, Das D, Parikh AP (2020) Bleurt: Learning robust metrics for text generation. Preprint, submitted April 9, https://arxiv.org/abs/2004.04696.Google Scholar
Shankar S, Zamfirescu-Pereira J, Hartmann B, Parameswaran AG, Arawjo I (2024) Who validates the validators? Aligning LLM-assisted evaluation of LLM outputs with human preferences. Yao L, Goel M, Ion A, Lopes P, eds. UIST ‘24 Proc. 37th Annual ACM Sympos. User Interface Software Technology (Association for Computing Machinery, New York), 131.Google Scholar
Shumailov I, Shumaylov Z, Zhao Y, Gal Y, Papernot N, Anderson R (2023) The curse of recursion: Training on generated data makes models forget. Preprint, submitted May 27, https://arxiv.org/abs/2305.17493.Google Scholar
Thompson B, Dhaliwal MP, Frisch P, Domhan T, Federico M (2024) A shocking amount of the web is machine translated: Insights from multi-way parallelism. Preprint, submitted January 11, https://arxiv.org/abs/2401.05749.Google Scholar
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, et al. (2023) Llama 2: Open foundation and fine-tuned chat models. Preprint, submitted July 18, https://arxiv.org/abs/2307.09288.Google Scholar
Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowledge Inform. Systems 42(2):245–284.Crossref, Google Scholar
Turing AM (1950) I—Computing machinery and intelligence. Mind LIX(236):433–460. Crossref, Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY), 6000–6010.Google Scholar
Vert JP (2023) How will generative AI disrupt data science in drug discovery? Nature Biotechnology 41(6):750–751.Crossref, Google Scholar
Veselovsky V, Ribeiro MH, West R (2023) Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. Preprint, submitted June 13, https://arxiv.org/abs/2306.07899.Google Scholar
Vieira I, Allred W, Lankford S, Castilho S, Way A (2024) How much data is enough data? Fine-tuning large language models for in-house translation: Performance evaluation across multiple dataset sizes. Preprint, submitted September 5, https://arxiv.org/abs/2409.03454.Google Scholar
Wang W, Sun T (2023) Human-AI co-creation in product ideation: The dual view of quality and diversity. Preprint, submitted December 20, http://dx.doi.org/10.2139/ssrn.4668241.Google Scholar
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, et al. (2020) Transformers: State-of-the-art natural language processing. Liu Q, Schlangen D, eds. Proc. 2020 Conf. Empirical Methods Natl. Language Processing System Demonstrations (Association for Computational Linguistics, Stroudsburg, PA), 38–45.Google Scholar
Xie Q, Luong MT, Hovy E, Le QV (2020a) Self-training with noisy student improves ImageNet classification. Boult T, Medioni G, Zabih R, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEEComputer Society, Washington, DC), 10687–10698.Google Scholar
Xie Q, Dai Z, Hovy E, Luong T, Le Q (2020b) Unsupervised data augmentation for consistency training. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 6256–6268.Google Scholar
Yang X, Pan L, Zhao X, Chen H, Petzold L, Wang WY, Cheng W (2023) A survey on detection of LLMs-generated content. Preprint, submitted October 24, https://arxiv.org/abs/2310.15654.Google Scholar
Yu AW, Dohan D, Luong MT, Zhao R, Chen K, Norouzi M, Le QV (2018) QANet: Combining local convolution with global self-attention for reading comprehension. Preprint, submitted April 23, https://arxiv.org/abs/1804.09541.Google Scholar
Zhou E, Lee D (2024) Generative artificial intelligence, human creativity, and art. PNAS Nexus 3(3):pgae052.Crossref, Google Scholar
Zhu Y, Lu S, Zheng L, Guo J, Zhang W, Wang J, Yu Y (2018) Texygen: A benchmarking platform for text generation models. Collins-Thompson K, Mei Q, Davison B, Liu Y, Yilmaz E, eds. 41st Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (Association for Computing Machinery, New York), 1097–1100.Google Scholar

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:July 25, 2024
Accepted:November 25, 2025
Published Online:May 11, 2026

Cite as

Jinghui Zhang, Mochen Yang, Dandan Qiao, Qiang Wei (2026) Regurgitative Training: The Value of Real Data in Training Large Language Models. Management Science 0(0).

https://doi.org/10.1287/mnsc.2024.07005

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Regurgitative Training: The Value of Real Data in Training Large Language Models

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News