Regurgitative Training: The Value of Real Data in Training Large Language Models

Published Online:https://doi.org/10.1287/mnsc.2024.07005

References

  • Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, et al. (2023) GPT-4 technical report. Preprint, submitted March 15, https://arxiv.org/abs/2303.08774.Google Scholar
  • Alemohammad S, Casco-Rodriguez J, Luzi L, Humayun AI, Babaei H, LeJeune D, Siahkoohi A, Baraniuk RG (2023) Self-consuming generative models go mad. Preprint, submitted July 4, https://arxiv.org/abs/2307.01850.Google Scholar
  • Anderson BR, Shah JH, Kreminski M (2024) Homogenization effects of large language models on human creative ideation. C&C ‘24 Proc. 16th Conf. Creativity Cognition, 413–425.Google Scholar
  • Anthropic (2024) The Claude 3 model family: Opus, Sonnet, Haiku. Working paper, Anthropic PBC, San Francisco, CA.Google Scholar
  • Bertrand Q, Bose AJ, Duplessis A, Jiralerspong M, Gidel G (2023) On the stability of iterative retraining of generative models on their own data. Preprint, submitted September 30, https://arxiv.org/abs/2310.00429.Google Scholar
  • Biever C (2023) ChatGPT broke the Turing test—The race is on for new ways to assess AI. Nature 619(7971):686–689.CrossrefGoogle Scholar
  • Bran AM, Cox S, Schilter O, Baldassari C, White AD, Schwaller P (2023) Chemcrow: Augmenting large-language models with chemistry tools. Preprint, submitted April 11, https://arxiv.org/abs/2304.05376.Google Scholar
  • Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, et al. (2020) Language models are few-shot learners. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 1877–1901.Google Scholar
  • Chen Z, Chan J (2023) Large language model in creative work: The role of collaboration modality and user expertise. Preprint, submitted September 27, http://dx.doi.org/10.1287/mnsc.2023.03014.Google Scholar
  • Chen J, Mueller J (2023) Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. Preprint, submitted August 30, https://arxiv.org/abs/2308.16175.Google Scholar
  • Chen Z, Deng Y, Yuan H, Ji K, Gu Q (2024) Self-play fine-tuning converts weak language models to strong language models. Preprint, submitted January 2, https://arxiv.org/abs/2401.01335.Google Scholar
  • Chen M, Tworek J, Jun H, Yuan Q, Pinto HPdO, Kaplan J, Edwards H, et al. (2021) Evaluating large language models trained on code. Preprint, submitted July 7, https://arxiv.org/abs/2107.03374.Google Scholar
  • Chowdhury S, Baili N, Vannah B (2021) Ensemble fine-tuned mBERT for translation quality estimation. Preprint, submitted September 8, https://arxiv.org/abs/2109.03914.Google Scholar
  • Cubuk ED, Zoph B, Shlens J, Le QV (2020) Randaugment: Practical automated data augmentation with a reduced search space. Boult T, Medioni G, Zabih R, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition Workshops (IEEE Computer Society, Los Alamitos, CA), 702–703.Google Scholar
  • Doshi AR, Hauser O (2023) Generative artificial intelligence enhances creativity. Preprint, submitted August 14, http://dx.doi.org/10.1126/sciadv.adn5290.Google Scholar
  • Feng Y, Dohmatob E, Yang P, Charton F, Kempe J (2024a) A tale of tails: Model collapse as a change of scaling laws. Kim B, Yue Y, Chaudhuri S, Fragkiadaki K, Khan ME, Sun Y, eds. ICLR 2024 Workshop Navigating Addressing Data Problems Foundation Models (OpenReview, Amherst, MA).Google Scholar
  • Feng Y, Dohmatob E, Yang P, Charton F, Kempe J (2024b) Beyond model collapse: Scaling up with synthesized data requires reinforcement. Salakhutdinov R, Heller K, Kolter Z, Oliver N, Weller A, eds. ICML 2024 Workshop Theoret. Foundations Foundation Models (PMLR, New York), 12942–12968.Google Scholar
  • Gerstgrasser M, Schaeffer R, Dey A, Rafailov R, Sleight H, Hughes J, Korbak T, et al. (2024) Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. Preprint, submitted April 1, https://arxiv.org/abs/2404.01413.Google Scholar
  • Ghaffary S (2024) Microsoft, Google and Meta bet on fake data to build AI models. Bloomberg (May 2), https://www.bloomberg.com/news/newsletters/2024-05-02/microsoft-google-and-meta-bet-on-fake-data-to-train-ai-models.Google Scholar
  • Gong Z, Zhong P, Hu W (2019) Diversity in machine learning. IEEE Access 7:64323–64350.CrossrefGoogle Scholar
  • Huang N, Burtch G, Gu B, Hong Y, Liang C, Wang K, Fu D, Yang B (2019) Motivating user-generated content with performance feedback: Evidence from randomized field experiments. Management Sci. 65(1):327–345.LinkGoogle Scholar
  • Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. Hutchins J, ed. Proc. Machine Translation Summit X Papers (International Association for Machine Translation, Washington, DC), 79–86.Google Scholar
  • Kornish LJ, Ulrich KT (2014) The importance of the raw idea in innovation: Testing the sow’s ear hypothesis. J. Marketing Res. 51(1):14–26.CrossrefGoogle Scholar
  • Kotelanski M, Gallo R, Nayak A, Savage T (2023) Methods to estimate large language model confidence. Preprint, submitted November 28, https://arxiv.org/abs/2312.03733.Google Scholar
  • Kraut RE, Resnick P (2012) Building Successful Online Communities: Evidence-Based Social Design (MIT Press, Cambridge, MA).CrossrefGoogle Scholar
  • Kuang L, Huang N, Hong Y, Yan Z (2019) Spillover effects of financial incentives on non-incentivized user engagement: Evidence from an online knowledge exchange platform. J. Management Inform. Systems 36(1):289–320.CrossrefGoogle Scholar
  • Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, et al. (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 9459–9474.Google Scholar
  • Li H, Dong Q, Tang Z, Wang C, Zhang X, Huang H, Huang S, et al. (2024) Synthetic data (almost) from scratch: Generalized instruction tuning for language models. Preprint, submitted February 20, https://arxiv.org/abs/2402.13064.Google Scholar
  • Lin Z, Trivedi S, Sun J (2023) Generating with confidence: Uncertainty quantification for black-box large language models. Preprint, submitted May 30, https://arxiv.org/abs/2305.19187.Google Scholar
  • McKinzie B, Gan Z, Fauconnier JP, Dodge S, Zhang B, Dufter P, Shah D, et al. (2024) Mm1: Methods, analysis & insights from multimodal LLM pre-training. Preprint, submitted March 14, https://arxiv.org/abs/2403.09611.Google Scholar
  • Meincke L, Girotra K, Nave G, Terwiesch C, Ulrich KT (2024) Using large language models for idea generation in innovation. Preprint, submitted August 2, http://dx.doi.org/10.2139/ssrn.4526071.Google Scholar
  • Miller GA (1995) Wordnet: A lexical database for English. Comm. ACM 38(11):39–41.CrossrefGoogle Scholar
  • Mims C (2024) The AI revolution is already losing steam. Wall Street J (May 31), https://www.wsj.com/tech/ai/the-ai-revolution-is-already-losing-steam-a93478b1.Google Scholar
  • Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. Agah A, Callan J, Rundensteiner E, Gauch S, eds. Proc. Ninth Internat. Conf. Inform. Knowledge Management (Association for Computing Machinery, New York), 86–93.Google Scholar
  • Noy S, Zhang W (2023) Experimental evidence on the productivity effects of generative artificial intelligence. Science 381(6654):187–192.CrossrefGoogle Scholar
  • Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, et al. (2022) Training language models to follow instructions with human feedback. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 27730–27744.Google Scholar
  • Padmakumar V, He H (2023) Does writing with language models reduce content diversity? Preprint, submitted September 11, https://arxiv.org/abs/2309.05196.Google Scholar
  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A method for automatic evaluation of machine translation. Isabelle P, Charniak E, Lin D, eds. Proc. 40th Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 311–318.Google Scholar
  • Pise NN, Kulkarni P (2008) A survey of semi-supervised learning methods. Zhao H, Deb K, eds. 2008 Internat. Conf. Comput. Intelligence Security, vol. 2 (IEEE, Piscataway, NJ), 30–34.Google Scholar
  • Rao R (2023) AI-generated data can poison future AI models. Sci. Amer. (July 28), https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/.Google Scholar
  • Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C (2023) Direct preference optimization: Your language model is secretly a reward model. Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 53728–53741.Google Scholar
  • Rawte V, Sheth A, Das A (2023) A survey of hallucination in large foundation models. Preprint, submitted September 12, https://arxiv.org/abs/2309.05922.Google Scholar
  • Rei R, Stewart C, Farinha AC, Lavie A (2020) COMET: A neural framework for MT evaluation. Preprint, submitted September 18, https://arxiv.org/abs/2009.09025.Google Scholar
  • Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Preprint, submitted August 27, https://arxiv.org/abs/1908.10084.Google Scholar
  • Scudder H (1965) Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inform. Theory 11(3):363–371.CrossrefGoogle Scholar
  • Seetharaman D (2024) For data-guzzling AI companies, the internet is too small. Wall Street J. (April 1), https://www.wsj.com/tech/ai/ai-training-data-synthetic-openai-anthropic-9230f8d8.Google Scholar
  • Sejnowski TJ (2023) Large language models and the reverse Turing test. Neural Comput. 35(3):309–342.CrossrefGoogle Scholar
  • Sellam T, Das D, Parikh AP (2020) Bleurt: Learning robust metrics for text generation. Preprint, submitted April 9, https://arxiv.org/abs/2004.04696.Google Scholar
  • Shankar S, Zamfirescu-Pereira J, Hartmann B, Parameswaran AG, Arawjo I (2024) Who validates the validators? Aligning LLM-assisted evaluation of LLM outputs with human preferences. Yao L, Goel M, Ion A, Lopes P, eds. UIST ‘24 Proc. 37th Annual ACM Sympos. User Interface Software Technology (Association for Computing Machinery, New York), 131.Google Scholar
  • Shumailov I, Shumaylov Z, Zhao Y, Gal Y, Papernot N, Anderson R (2023) The curse of recursion: Training on generated data makes models forget. Preprint, submitted May 27, https://arxiv.org/abs/2305.17493.Google Scholar
  • Thompson B, Dhaliwal MP, Frisch P, Domhan T, Federico M (2024) A shocking amount of the web is machine translated: Insights from multi-way parallelism. Preprint, submitted January 11, https://arxiv.org/abs/2401.05749.Google Scholar
  • Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, et al. (2023) Llama 2: Open foundation and fine-tuned chat models. Preprint, submitted July 18, https://arxiv.org/abs/2307.09288.Google Scholar
  • Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowledge Inform. Systems 42(2):245–284.CrossrefGoogle Scholar
  • Turing AM (1950) I—Computing machinery and intelligence. Mind LIX(236):433–460. CrossrefGoogle Scholar
  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY), 6000–6010.Google Scholar
  • Vert JP (2023) How will generative AI disrupt data science in drug discovery? Nature Biotechnology 41(6):750–751.CrossrefGoogle Scholar
  • Veselovsky V, Ribeiro MH, West R (2023) Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. Preprint, submitted June 13, https://arxiv.org/abs/2306.07899.Google Scholar
  • Vieira I, Allred W, Lankford S, Castilho S, Way A (2024) How much data is enough data? Fine-tuning large language models for in-house translation: Performance evaluation across multiple dataset sizes. Preprint, submitted September 5, https://arxiv.org/abs/2409.03454.Google Scholar
  • Wang W, Sun T (2023) Human-AI co-creation in product ideation: The dual view of quality and diversity. Preprint, submitted December 20, http://dx.doi.org/10.2139/ssrn.4668241.Google Scholar
  • Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, et al. (2020) Transformers: State-of-the-art natural language processing. Liu Q, Schlangen D, eds. Proc. 2020 Conf. Empirical Methods Natl. Language Processing System Demonstrations (Association for Computational Linguistics, Stroudsburg, PA), 38–45.Google Scholar
  • Xie Q, Luong MT, Hovy E, Le QV (2020a) Self-training with noisy student improves ImageNet classification. Boult T, Medioni G, Zabih R, eds. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEEComputer Society, Washington, DC), 10687–10698.Google Scholar
  • Xie Q, Dai Z, Hovy E, Luong T, Le Q (2020b) Unsupervised data augmentation for consistency training. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 6256–6268.Google Scholar
  • Yang X, Pan L, Zhao X, Chen H, Petzold L, Wang WY, Cheng W (2023) A survey on detection of LLMs-generated content. Preprint, submitted October 24, https://arxiv.org/abs/2310.15654.Google Scholar
  • Yu AW, Dohan D, Luong MT, Zhao R, Chen K, Norouzi M, Le QV (2018) QANet: Combining local convolution with global self-attention for reading comprehension. Preprint, submitted April 23, https://arxiv.org/abs/1804.09541.Google Scholar
  • Zhou E, Lee D (2024) Generative artificial intelligence, human creativity, and art. PNAS Nexus 3(3):pgae052.CrossrefGoogle Scholar
  • Zhu Y, Lu S, Zheng L, Guo J, Zhang W, Wang J, Yu Y (2018) Texygen: A benchmarking platform for text generation models. Collins-Thompson K, Mei Q, Davison B, Liu Y, Yilmaz E, eds. 41st Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (Association for Computing Machinery, New York), 1097–1100.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.