Regurgitative Training: The Value of Real Data in Training Large Language Models

Jinghui Zhang
Jinghui Zhang
[email protected]
https://orcid.org/0009-0002-2438-8268
School of Economics and Management, Tsinghua University, Beijing 100084, China
Search for more papers by this author
,
Mochen Yang
Mochen Yang
[email protected]
https://orcid.org/0000-0001-5101-9041
Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455
Search for more papers by this author
,
Dandan Qiao
Corresponding Author
Dandan Qiao
[email protected]
https://orcid.org/0000-0002-7038-1940
School of Computing, National University of Singapore, Singapore 117418, Singapore
Search for more papers by this author
,
Qiang Wei
Qiang Wei
[email protected]
https://orcid.org/0000-0002-8397-7129
School of Economics and Management, Tsinghua University, Beijing 100084, China
Search for more papers by this author

School of Economics and Management, Tsinghua University, Beijing 100084, China

Search for more papers by this author

Mochen Yang

[email protected]

https://orcid.org/0000-0001-5101-9041

Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455

Search for more papers by this author

Dandan Qiao

Corresponding Author

Dandan Qiao

[email protected]

https://orcid.org/0000-0002-7038-1940

School of Computing, National University of Singapore, Singapore 117418, Singapore

Search for more papers by this author

Qiang Wei

[email protected]

https://orcid.org/0000-0002-8397-7129

School of Economics and Management, Tsinghua University, Beijing 100084, China

Search for more papers by this author

Published Online:11 May 2026https://doi.org/10.1287/mnsc.2024.07005

Abstract

What happens if we train a new large language model (LLM) using data at least partially generated by other LLMs? The explosive success of LLMs means that content online will increasingly be generated by LLMs rather than humans, which inevitably enters the training data sets of next-generation LLMs. In this paper, we study the implications of such “regurgitative training” on LLM performance. Starting with the machine translation task (a representative language task with well-established evaluation criteria), we fine-tune LLMs with data generated either by themselves or by other LLMs, and we find strong evidence that regurgitative training handicaps the performance of fine-tuned LLMs. A comparison between LLM-generated data and real data reveals suggestive evidence that higher error rates and lower lexical diversity in LLM-generated data may be at play. Accordingly, we propose and evaluate three strategies to mitigate the performance loss by (i) prioritizing high-quality LLM-generated data, (ii) mixing data generated by multiple LLMs, and (iii) prioritizing LLM-generated data that most resemble real data. All three strategies can improve the performance of regurgitative training to some extent but cannot fully close the gap from training with real data. This highlights that real, human-generated data cannot be easily substituted by LLM-generated data in training LLMs. Additionally, we investigate regurgitative training on a creative ideation task with human judgement-based evaluations. Interestingly, we find that preference-based fine-tuning with human feedback on LLM-generated ideas can actually improve ideation performance. This showcases that human preference data when combined with LLM-generated data can bring performance gains.

This paper was accepted by Hemant Bhargava, information systems.

Funding: This work was supported by the National Natural Science Foundation of China [Grants 72421001 and 72172070] and the Singapore Ministry of Education Academic Research Fund Tier 2 A-8003504 [Robert Brown Promising Researcher Award MOE-T2EP40].

Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2024.07005.

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:July 25, 2024
Accepted:November 25, 2025
Published Online:May 11, 2026

Cite as

Jinghui Zhang, Mochen Yang, Dandan Qiao, Qiang Wei (2026) Regurgitative Training: The Value of Real Data in Training Large Language Models. Management Science 0(0).

https://doi.org/10.1287/mnsc.2024.07005

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Regurgitative Training: The Value of Real Data in Training Large Language Models

Abstract

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News