Regurgitative Training: The Value of Real Data in Training Large Language Models

Published Online:https://doi.org/10.1287/mnsc.2024.07005

What happens if we train a new large language model (LLM) using data at least partially generated by other LLMs? The explosive success of LLMs means that content online will increasingly be generated by LLMs rather than humans, which inevitably enters the training data sets of next-generation LLMs. In this paper, we study the implications of such “regurgitative training” on LLM performance. Starting with the machine translation task (a representative language task with well-established evaluation criteria), we fine-tune LLMs with data generated either by themselves or by other LLMs, and we find strong evidence that regurgitative training handicaps the performance of fine-tuned LLMs. A comparison between LLM-generated data and real data reveals suggestive evidence that higher error rates and lower lexical diversity in LLM-generated data may be at play. Accordingly, we propose and evaluate three strategies to mitigate the performance loss by (i) prioritizing high-quality LLM-generated data, (ii) mixing data generated by multiple LLMs, and (iii) prioritizing LLM-generated data that most resemble real data. All three strategies can improve the performance of regurgitative training to some extent but cannot fully close the gap from training with real data. This highlights that real, human-generated data cannot be easily substituted by LLM-generated data in training LLMs. Additionally, we investigate regurgitative training on a creative ideation task with human judgement-based evaluations. Interestingly, we find that preference-based fine-tuning with human feedback on LLM-generated ideas can actually improve ideation performance. This showcases that human preference data when combined with LLM-generated data can bring performance gains.

This paper was accepted by Hemant Bhargava, information systems.

Funding: This work was supported by the National Natural Science Foundation of China [Grants 72421001 and 72172070] and the Singapore Ministry of Education Academic Research Fund Tier 2 A-8003504 [Robert Brown Promising Researcher Award MOE-T2EP40].

Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2024.07005.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.