Large Language Models for Market Research: A Data-Augmentation Approach

Mengxin Wang
Mengxin Wang
[email protected]
https://orcid.org/0000-0002-3378-9402
Naveen Jindal School of Management, The University of Texas at Dallas, Richardson, Texas 75080
Search for more papers by this author
,
Dennis J. Zhang
Dennis J. Zhang
[email protected]
https://orcid.org/0000-0002-4544-775X
Olin School of Business, Washington University in St. Louis, St. Louis, Missouri 63130
Search for more papers by this author
,
Heng Zhang
Corresponding Author
Heng Zhang
[email protected]
https://orcid.org/0000-0002-6105-6994
W. P. Carey School of Business, Arizona State University, Phoenix, Arizona 85069
Search for more papers by this author

Naveen Jindal School of Management, The University of Texas at Dallas, Richardson, Texas 75080

Search for more papers by this author

Dennis J. Zhang

[email protected]

https://orcid.org/0000-0002-4544-775X

Olin School of Business, Washington University in St. Louis, St. Louis, Missouri 63130

Search for more papers by this author

Heng Zhang

Corresponding Author

Heng Zhang

[email protected]

https://orcid.org/0000-0002-6105-6994

W. P. Carey School of Business, Arizona State University, Phoenix, Arizona 85069

Search for more papers by this author

Published Online:17 Mar 2026https://doi.org/10.1287/mksc.2025.0009

References

Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, et al. (2023) GPT-4 technical report. Preprint, submitted March 15, https://arxiv.org/abs/2303.08774.Google Scholar
Allenby GM, Rossi PE (2006) Hierarchical Bayes models. The Handbook of Marketing Research: Uses, Misuses, and Future Advances (SAGE Publications, Thousand Oaks, CA), 418–440.Crossref, Google Scholar
Angelopoulos AN, Duchi JC, Zrnic T (2023a) PPI++: Efficient prediction-powered inference. Preprint, submitted November 2, https://arxiv.org/abs/2311.01453.Google Scholar
Angelopoulos AN, Bates S, Fannjiang C, Jordan MI, Zrnic T (2023b) Prediction-powered inference. Science 382(6671):669–674.Crossref, Google Scholar
Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D (2023) Out of one, many: Using language models to simulate human samples. Political Anal. 31(3):337–351.Crossref, Google Scholar
Bastani H, Zhang DJ, Zhang H (2022) Applied machine learning in operations management. Innovative Technology at the Interface of Finance and Operations: Volume I (Springer Nature), 189–222.Crossref, Google Scholar
Beltagy I, Lo K, Cohan A (2019) Scibert: A pretrained language model for scientific text. Preprint, submitted March 26, https://arxiv.org/abs/1903.10676.Google Scholar
Bound J, Brown C, Mathiowetz N (2001) Measurement error in survey data. Handbook of Econometrics, vol. 5 (Elsevier, Amsterdam), 3705–3843.Crossref, Google Scholar
Brand J, Israeli A, Ngwe D (2023) Using LLMs for market research. Preprint, submitted March 30, https://doi.org/10.2139/ssrn.4395751.Google Scholar
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. (2020) Language models are few-shot learners. Preprint, submitted May 28, https://arxiv.org/abs/2005.14165.Google Scholar
Chardon H, Lerasle M, Mourtada J (2024) Finite-sample performance of the maximum likelihood estimator in logistic regression. Preprint, submitted November 4, https://arxiv.org/abs/2411.02137.Google Scholar
Chen Y, Liu TX, Shan Y, Zhong S (2023) The emergence of economic rationality of GPT. Proc. Natl. Acad. Sci. USA 120(51):e2316205120.Crossref, Google Scholar
Chen X, Owen Z, Pixton C, Simchi-Levi D (2022) A statistical learning approach to personalization in revenue management. Management Sci. 68(3):1923–1937.Link, Google Scholar
Choi T-M, Kumar S, Yue X, Chan H-L (2022) Disruptive technologies and operations management in the industry 4.0 era and beyond. Production Oper. Management 31(1):9–31.Crossref, Google Scholar
Chomsky N (1956) Three models for the description of language. IEEE Trans. Inform. Theory 2(3):113–124.Crossref, Google Scholar
Connell P, Choi JH (2024) Estimating and correcting for misclassification error in empirical textual research. Preprint, submitted September 5, https://doi.org/10.2139/ssrn.4913179.Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint, submitted October 11, https://arxiv.org/abs/1810.04805.Google Scholar
Dzyabura D, Jagabathula S (2018) Offline assortment optimization in the presence of an online channel. Management Sci. 64(6):2767–2786.Link, Google Scholar
Eggers F, Sattler H, Teichert T, Völckner F (2022) Choice-based conjoint analysis. Handbook of Market Research (Springer, Cham, Switzerland), 781–819.Crossref, Google Scholar
Feller W (1971) An Introduction to Probability Theory and Its Applications, Volume II, 2nd ed. (John Wiley & Sons, New York).Google Scholar
Girotra K, Meincke L, Terwiesch C, Ulrich KT (2023) Ideas are dimes a dozen: Large language models for idea generation in innovation. Preprint, submitted August 2, https://doi.org/10.2139/ssrn.4526071.Google Scholar
Goli A, Singh A (2024) Frontiers: Can large language models capture human preferences? Marketing Sci. 43(4):709–722.Link, Google Scholar
Green PE, Srinivasan V (1978) Conjoint analysis in consumer research: Issues and outlook. J. Consumer Res. 5(2):103–123.Crossref, Google Scholar
Green PE, Srinivasan V (1990) Conjoint analysis in marketing: New developments with implications for research and practice. J. Marketing 54(4):3–19.Crossref, Google Scholar
Gui G, Toubia O (2023) The challenge of using LLMs to simulate human behavior: A causal inference perspective. Preprint, submitted December 24, https://arxiv.org/abs/2312.15524.Google Scholar
Gururangan S, Marasović A, Swayamdipta S, Lo K, Beltagy I, Downey D, Smith NA (2020) Don’t stop pretraining: Adapt language models to domains and tasks. Preprint, submitted April 23, https://arxiv.org/abs/2004.10964.Google Scholar
Hair J Jr, Page M, Brunsveld N (2019) Essentials of Business Research Methods, 4th ed. (Routledge, New York).Crossref, Google Scholar
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. Preprint, submitted March 9, https://arxiv.org/abs/1503.02531.Google Scholar
Horton JJ (2023) Large language models as simulated economic agents: What can we learn from homo silicus? NBER Working Paper No. 31122, National Bureau of Economic Research, Cambridge, MA.Google Scholar
Huang Y, Yuan Z, Zhou Y, Guo K, Wang X, Zhuang H, Sun W, et al. (2024) Social science meets LLMs: How reliable are large language models in social simulations? Preprint, submitted October 30, https://arxiv.org/abs/2410.23426.Google Scholar
HuggingFace (2024) Meta-LLaMA. Accessed August 31, 2024, https://huggingface.co/meta-llama/Meta-Llama-3-8B#:∼:text=Training%20Data,over%2010M%20human%2Dannotated%20examples.Google Scholar
Ji W, Lei L, Zrnic T (2025) Predictions as surrogates: Revisiting surrogate outcomes in the age of AI. Preprint, submitted January 16, https://arxiv.org/abs/2501.09731.Google Scholar
Kessels R, Goos P, Vandebroek M (2008) Optimal designs for conjoint experiments. Comput. Statist. Data Anal. 52(5):2369–2387.Crossref, Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
Kohli R, Sukumar R (1990) Heuristics for product-line design using conjoint analysis. Management Sci. 36(12):1464–1478.Link, Google Scholar
Kreps S, Prasad S, Brownstein JS, Hswen Y, Garibaldi BT, Zhang B, Kriner DL (2020) Factors associated with US adults’ likelihood of accepting COVID-19 vaccination. JAMA Network Open 3(10):e2025594.Crossref, Google Scholar
Ludwig J, Mullainathan S, Rambachan A (2024) Large language models: An applied econometric framework. Preprint, submitted December 9, https://arxiv.org/abs/2412.07031.Google Scholar
Naveed H, Ullah Khan A, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, et al. (2023) A comprehensive overview of large language models. Preprint, submitted July 12, https://arxiv.org/abs/2307.06435.Google Scholar
Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. Handbook of Econometrics, vol. 4 (North Holland, Amsterdam), 2111–2245.Crossref, Google Scholar
Olsen TL, Tomlin B (2020) Industry 4.0: Opportunities and challenges for operations management. Manufacturing Service Oper. Management 22(1):113–122.Link, Google Scholar
Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans. Knowledge Data Engrg. 22(10):1345–1359.Crossref, Google Scholar
Parthasarathy VB, Zafar A, Khan A, Shahid A (2024) The ultimate guide to fine-tuning LLMs from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and opportunities. Preprint, submitted August 23, https://arxiv.org/abs/2408.13296.Google Scholar
Peng A, Allard J, Heidel S (2024) Fine-tuning now available for GPT-4o. Accessed December 15, 2024, https://openai.com/index/gpt-4o-fine-tuning/.Google Scholar
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. OpenAI. Accessed June 3, 2025, https://openai.com/index/language-unsupervised/.Google Scholar
Raschka S (2018) Model evaluation, model selection, and algorithm selection in machine learning. Preprint, submitted November 13, https://arxiv.org/abs/1811.12808.Google Scholar
Shane SA, Ulrich KT (2004) 50th anniversary article: Technological innovation, product development, and entrepreneurship in management science. Management Sci. 50(2):133–144.Link, Google Scholar
Solomon MR (2020) Consumer Behavior: Buying, Having, and Being (Pearson, Harlow, England).Google Scholar
Spencer V (2019) Choice modeling sports cars. Accessed October 9, 2024, https://github.com/spensorflow/Marketing-Analytics---Choice-Modeling-Sports-Car-Sales.Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Preprint, submitted September 10, https://arxiv.org/abs/1409.3215.Google Scholar
Teixeira L (2023) Prompt engineering: Compressing text to ideas and decompressing back with sparse priming representations. Accessed December 29, 2024, https://medium.com/@lawrenceteixeira/prompt-engineering-compressing-text-to-ideas-and-decompressing-back.Google Scholar
Terwiesch C (2019) Om forum—Empirical research in operations management: From field studies to analyzing digital exhaust. Manufacturing Service Oper. Management 21(4):713–722.Link, Google Scholar
Van der Vaart AW (2000) Asymptotic Statistics, vol. 3 (Cambridge University Press).Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, et al. (2017) Attention is all you need. Guyon I, von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates, Inc., Red Hook, NY), 5998–6008.Google Scholar
Wainwright MJ (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Wang X, Camm JD, Curry DJ (2009) A branch-and-price approach to the share-of-choice product line design problem. Management Sci. 55(10):1718–1728.Link, Google Scholar
Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 24824–24837.Google Scholar
Yang K, Li H, Wen H, Peng T-Q, Tang J, Liu H (2024) Are large language models (LLMs) good social predictors? Preprint, submitted February 20, https://arxiv.org/abs/2402.12620.Google Scholar
Yao S, Yu D, Zhao J, Shafran I, Griffiths T, Cao Y, Narasimhan K (2023) Tree of thoughts: Deliberate problem solving with large language models. Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates, Inc., Red Hook, NY), 11809–11822.Google Scholar
Yoo Y, Henfridsson O, Kallinikos J, Gregory R, Burtch G, Chatterjee S, Sarker S (2024) The next frontiers of digital innovation research. Inform. Systems Res. 35(4):1507–1523.Link, Google Scholar
Zhang J, Xue W, Yu Y, Tan Y (2023) Debiasing ML-or AI-generated regressors in partial linear models. Preprint, submitted November 30, https://doi.org/10.2139/ssrn.4636026.Google Scholar
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, et al. (2020) A comprehensive survey on transfer learning. Proc. IEEE 109(1):43–76.Crossref, Google Scholar
Ziems C, Held W, Shaikh O, Chen J, Zhang Z, Yang D (2024) Can large language models transform computational social science? Comput. Linguist. 50(1):237–291.Crossref, Google Scholar

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:January 06, 2025
Accepted:January 08, 2026
Published Online:March 17, 2026

Cite as

Mengxin Wang, Dennis J. Zhang, Heng Zhang (2026) Large Language Models for Market Research: A Data-Augmentation Approach. Marketing Science 0(0).

https://doi.org/10.1287/mksc.2025.0009

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Large Language Models for Market Research: A Data-Augmentation Approach

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News