Regurgitative Training: The Value of Real Data in Training Large Language Models
Published Online:11 May 2026https://doi.org/10.1287/mnsc.2024.07005
References
- (2023) GPT-4 technical report. Preprint, submitted March 15, https://arxiv.org/abs/2303.08774.Google Scholar
- (2023) Self-consuming generative models go mad. Preprint, submitted July 4, https://arxiv.org/abs/2307.01850.Google Scholar
- (2024) Homogenization effects of large language models on human creative ideation. C&C ‘24 Proc. 16th Conf. Creativity Cognition, 413–425.Google Scholar
- Anthropic (2024) The Claude 3 model family: Opus, Sonnet, Haiku. Working paper, Anthropic PBC, San Francisco, CA.Google Scholar
- (2023) On the stability of iterative retraining of generative models on their own data. Preprint, submitted September 30, https://arxiv.org/abs/2310.00429.Google Scholar
- (2023) ChatGPT broke the Turing test—The race is on for new ways to assess AI. Nature 619(7971):686–689.Crossref, Google Scholar
- (2023) Chemcrow: Augmenting large-language models with chemistry tools. Preprint, submitted April 11, https://arxiv.org/abs/2304.05376.Google Scholar
- (2020) Language models are few-shot learners. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 1877–1901.Google Scholar
- (2023) Large language model in creative work: The role of collaboration modality and user expertise. Preprint, submitted September 27, http://dx.doi.org/10.1287/mnsc.2023.03014.Google Scholar
- (2023) Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. Preprint, submitted August 30, https://arxiv.org/abs/2308.16175.Google Scholar
- (2024) Self-play fine-tuning converts weak language models to strong language models. Preprint, submitted January 2, https://arxiv.org/abs/2401.01335.Google Scholar
- (2021) Evaluating large language models trained on code. Preprint, submitted July 7, https://arxiv.org/abs/2107.03374.Google Scholar
- (2021) Ensemble fine-tuned mBERT for translation quality estimation. Preprint, submitted September 8, https://arxiv.org/abs/2109.03914.Google Scholar
- (2020) Randaugment: Practical automated data augmentation with a reduced search space. Boult T, Medioni G, Zabih R, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition Workshops (IEEE Computer Society, Los Alamitos, CA), 702–703.Google Scholar
- (2023) Generative artificial intelligence enhances creativity. Preprint, submitted August 14, http://dx.doi.org/10.1126/sciadv.adn5290.Google Scholar
- (2024a) A tale of tails: Model collapse as a change of scaling laws. Kim B, Yue Y, Chaudhuri S, Fragkiadaki K, Khan ME, Sun Y, eds. ICLR 2024 Workshop Navigating Addressing Data Problems Foundation Models (OpenReview, Amherst, MA).Google Scholar
- (2024b) Beyond model collapse: Scaling up with synthesized data requires reinforcement. Salakhutdinov R, Heller K, Kolter Z, Oliver N, Weller A, eds. ICML 2024 Workshop Theoret. Foundations Foundation Models (PMLR, New York), 12942–12968.Google Scholar
- (2024) Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. Preprint, submitted April 1, https://arxiv.org/abs/2404.01413.Google Scholar
- (2024) Microsoft, Google and Meta bet on fake data to build AI models. Bloomberg (May 2), https://www.bloomberg.com/news/newsletters/2024-05-02/microsoft-google-and-meta-bet-on-fake-data-to-train-ai-models.Google Scholar
- (2019) Diversity in machine learning. IEEE Access 7:64323–64350.Crossref, Google Scholar
- (2019) Motivating user-generated content with performance feedback: Evidence from randomized field experiments. Management Sci. 65(1):327–345.Link, Google Scholar
- (2005) Europarl: A parallel corpus for statistical machine translation. Hutchins J, ed. Proc. Machine Translation Summit X Papers (International Association for Machine Translation, Washington, DC), 79–86.Google Scholar
- (2014) The importance of the raw idea in innovation: Testing the sow’s ear hypothesis. J. Marketing Res. 51(1):14–26.Crossref, Google Scholar
- (2023) Methods to estimate large language model confidence. Preprint, submitted November 28, https://arxiv.org/abs/2312.03733.Google Scholar
- (2012) Building Successful Online Communities: Evidence-Based Social Design (MIT Press, Cambridge, MA).Crossref, Google Scholar
- (2019) Spillover effects of financial incentives on non-incentivized user engagement: Evidence from an online knowledge exchange platform. J. Management Inform. Systems 36(1):289–320.Crossref, Google Scholar
- (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 9459–9474.Google Scholar
- (2024) Synthetic data (almost) from scratch: Generalized instruction tuning for language models. Preprint, submitted February 20, https://arxiv.org/abs/2402.13064.Google Scholar
- (2023) Generating with confidence: Uncertainty quantification for black-box large language models. Preprint, submitted May 30, https://arxiv.org/abs/2305.19187.Google Scholar
- (2024) Mm1: Methods, analysis & insights from multimodal LLM pre-training. Preprint, submitted March 14, https://arxiv.org/abs/2403.09611.Google Scholar
- (2024) Using large language models for idea generation in innovation. Preprint, submitted August 2, http://dx.doi.org/10.2139/ssrn.4526071.Google Scholar
- (1995) Wordnet: A lexical database for English. Comm. ACM 38(11):39–41.Crossref, Google Scholar
- (2024) The AI revolution is already losing steam. Wall Street J (May 31), https://www.wsj.com/tech/ai/the-ai-revolution-is-already-losing-steam-a93478b1.Google Scholar
- (2000) Analyzing the effectiveness and applicability of co-training. Agah A, Callan J, Rundensteiner E, Gauch S, eds. Proc. Ninth Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 86–93.Google Scholar
- (2023) Experimental evidence on the productivity effects of generative artificial intelligence. Science 381(6654):187–192.Crossref, Google Scholar
- (2022) Training language models to follow instructions with human feedback. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 27730–27744.Google Scholar
- (2023) Does writing with language models reduce content diversity? Preprint, submitted September 11, https://arxiv.org/abs/2309.05196.Google Scholar
- (2002) BLEU: A method for automatic evaluation of machine translation. Isabelle P, Charniak E, Lin D, eds. Proc. 40th Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 311–318.Google Scholar
- (2008) A survey of semi-supervised learning methods. Zhao H, Deb K, eds. 2008 Internat. Conf. Comput. Intelligence Security, vol. 2 (IEEE, Piscataway, NJ), 30–34.Google Scholar
- (2023) AI-generated data can poison future AI models. Sci. Amer. (July 28), https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/.Google Scholar
- (2023) Direct preference optimization: Your language model is secretly a reward model. Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 53728–53741.Google Scholar
- (2023) A survey of hallucination in large foundation models. Preprint, submitted September 12, https://arxiv.org/abs/2309.05922.Google Scholar
- (2020) COMET: A neural framework for MT evaluation. Preprint, submitted September 18, https://arxiv.org/abs/2009.09025.Google Scholar
- (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Preprint, submitted August 27, https://arxiv.org/abs/1908.10084.Google Scholar
- (1965) Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inform. Theory 11(3):363–371.Crossref, Google Scholar
- (2024) For data-guzzling AI companies, the internet is too small. Wall Street J. (April 1), https://www.wsj.com/tech/ai/ai-training-data-synthetic-openai-anthropic-9230f8d8.Google Scholar
- (2023) Large language models and the reverse Turing test. Neural Comput. 35(3):309–342.Crossref, Google Scholar
- (2020) Bleurt: Learning robust metrics for text generation. Preprint, submitted April 9, https://arxiv.org/abs/2004.04696.Google Scholar
- (2024) Who validates the validators? Aligning LLM-assisted evaluation of LLM outputs with human preferences. Yao L, Goel M, Ion A, Lopes P, eds. UIST ‘24 Proc. 37th Annual ACM Sympos. User Interface Software Technology (Association for Computing Machinery, New York), 131.Google Scholar
- (2023) The curse of recursion: Training on generated data makes models forget. Preprint, submitted May 27, https://arxiv.org/abs/2305.17493.Google Scholar
- (2024) A shocking amount of the web is machine translated: Insights from multi-way parallelism. Preprint, submitted January 11, https://arxiv.org/abs/2401.05749.Google Scholar
- (2023) Llama 2: Open foundation and fine-tuned chat models. Preprint, submitted July 18, https://arxiv.org/abs/2307.09288.Google Scholar
- (2015) Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowledge Inform. Systems 42(2):245–284.Crossref, Google Scholar
- (1950) I—Computing machinery and intelligence. Mind LIX(236):433–460. Crossref, Google Scholar
- (2017) Attention is all you need. von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY), 6000–6010.Google Scholar
- (2023) How will generative AI disrupt data science in drug discovery? Nature Biotechnology 41(6):750–751.Crossref, Google Scholar
- (2023) Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. Preprint, submitted June 13, https://arxiv.org/abs/2306.07899.Google Scholar
- (2024) How much data is enough data? Fine-tuning large language models for in-house translation: Performance evaluation across multiple dataset sizes. Preprint, submitted September 5, https://arxiv.org/abs/2409.03454.Google Scholar
- (2023) Human-AI co-creation in product ideation: The dual view of quality and diversity. Preprint, submitted December 20, http://dx.doi.org/10.2139/ssrn.4668241.Google Scholar
- (2020) Transformers: State-of-the-art natural language processing. Liu Q, Schlangen D, eds. Proc. 2020 Conf. Empirical Methods Natl. Language Processing System Demonstrations (Association for Computational Linguistics, Stroudsburg, PA), 38–45.Google Scholar
- (2020a) Self-training with noisy student improves ImageNet classification. Boult T, Medioni G, Zabih R, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEEComputer Society, Washington, DC), 10687–10698.Google Scholar
- (2020b) Unsupervised data augmentation for consistency training. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 6256–6268.Google Scholar
- (2023) A survey on detection of LLMs-generated content. Preprint, submitted October 24, https://arxiv.org/abs/2310.15654.Google Scholar
- (2018) QANet: Combining local convolution with global self-attention for reading comprehension. Preprint, submitted April 23, https://arxiv.org/abs/1804.09541.Google Scholar
- (2024) Generative artificial intelligence, human creativity, and art. PNAS Nexus 3(3):pgae052.Crossref, Google Scholar
- (2018) Texygen: A benchmarking platform for text generation models. Collins-Thompson K, Mei Q, Davison B, Liu Y, Yilmaz E, eds. 41st Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (Association for Computing Machinery, New York), 1097–1100.Google Scholar

