Quality Control for Crowd Workers and for Language Models: A Framework for Free-Text Response Evaluation with No Ground Truth

Published Online:https://doi.org/10.1287/isre.2023.0426

References

  • Abbasi A, Parsons J, Pant G, Sheng ORL, Sarker S (2024) Pathways for design research on artificial intelligence. Inform. Systems Res. 35(2):441–459.LinkGoogle Scholar
  • Abedissa T, Usbeck R, Assabie Y (2023) AmQA: Amharic question answering dataset. Preprint, submitted March 6, https://arxiv.org/abs/2303.03290.Google Scholar
  • Abualsaud M, Chen IX, Ghajar K, Minh LNL, Smucker MD, Tahami AV, Zhang D (2021) UWaterlooMDS at the TREC 2021 health misinformation track. 30th Text REtrieval Conf. Proc. (National Institute of Standards and Technology, Gaithersburg, MD).‏Google Scholar
  • Alfonseca E, Pérez D (2004) Automatic assessment of open ended questions with a BLEU-inspired algorithm and shallow NLP. Vicedo JL, Martínez-Barco P, Muńoz R, Saiz Noeda M, eds. Advances in Natural Language Processing. EsTAL 2004, Lecture Notes in Computer Science, vol. 3230 (Springer, Berlin, Heidelberg).Google Scholar
  • Awasthi P, Blum A, Haghtalab N, Mansour Y (2017) Efficient PAC learning from the crowd. Proc. 2017 Conf. Learn. Theory, vol. 65 (PMLR, New York), 127–150.Google Scholar
  • Bajaj P, Campos D, Craswell N, Deng L, Gao J, Liu X, Majumder R, et al. (2016) MS MARCO: A human generated machine reading comprehension dataset. Preprint, submitted November 28, https://arxiv.org/abs/1611.09268.Google Scholar
  • Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on freebase from question-answer pairs. Proc. 2013 Conf. Empirical Methods Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 1533–1544.Google Scholar
  • Bishop CM (2006) Pattern Recognition and Machine Learning (Springer, New York).Google Scholar
  • Bondarenko A, Fröbe M, Kasturia V, Hagen M, Völske M, Stein B (2019) Webis at TREC 2019: Decision Track. TREC 2019 Proc. (National Institute of Standards and Technology, Gaithersburg, MD).Google Scholar
  • Bonthu S, Rama Sree S, Krishna Prasad MHM (2021) Automated short answer grading using deep learning: A survey. Internat. Cross-Domain Conf. Machine Learn. Knowledge Extraction (Springer, Berlin, Heidelberg), 61–78.Google Scholar
  • Brand C, Ganian R, Simonov K (2023) A parameterized theory of PAC learning. Proc. AAAI Conf. Artificial Intelligence, vol. 37 (AAAI Press, Palo Alto, CA), 6834–6841.Google Scholar
  • Branson S, van Horn G, Perona P (2017) Lean crowdsourcing: Combining humans and machines in an online system. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 7474–7483.Google Scholar
  • Braylan A, Lease M (2020) Modeling and aggregation of complex annotations via annotation distances. Proc. Web Conf. (Association for Computing Machinery, New York), 1807–1818.Google Scholar
  • Braylan A, Alonso O, Lease M (2022) Measuring annotator agreement generally across complex structured, multi-object, and free-text annotation tasks. Proc. ACM Web Conf. (Association for Computing Machinery, New York), 1720–1730.Google Scholar
  • Burrows S, Gurevych I, Stein B (2015) The eras and trends of automatic short answer grading. Internat. J. Artificial Intelligence Ed. 25(1):60–117.CrossrefGoogle Scholar
  • Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, St. John R, Constant N, et al. (2018) Universal sentence encoder. Preprint, submitted March 29, https://arxiv.org/abs/1803.11175.Google Scholar
  • Chai L, Sun H, Wang Z (2022) An error consistency based approach to answer aggregation in open-ended crowdsourcing. Inform. Sci. 608:1029–1044.CrossrefGoogle Scholar
  • Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, Chen H, et al. (2024) A survey on evaluation of large language models. ACM Trans. Intelligent Systems Tech. 15(3):1–45.CrossrefGoogle Scholar
  • Chen X, Aksitov R, Alon U, Ren J, Xiao K, Yin P, Prakash S, Sutton C, Wang X, Zhou D (2023) Universal self-consistency for large language model generation. Preprint, submitted November 29, https://arxiv.org/abs/2311.17311.Google Scholar
  • Chiang CH, Lee HY (2023) Can large language models be an alternative to human evaluations? Preprint, submitted May 3, https://arxiv.org/abs/2305.01937.Google Scholar
  • Clarke CLA, Maistro M, Smucker MD (2022) Overview of the TREC 2021 health misinformation track. Proc. Thirtieth Text Retrieval Conf. (TREC 2021), Special Publication 500-335 (National Institute of Standards and Technology (NIST), Washington, DC).Google Scholar
  • Clarke CL, Rizvi S, Smucker MD, Maistro M, Zuccon G (2020) Overview of the TREC 2020 health misinformation track. TREC 2020 Proc.‏Google Scholar
  • Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. Proc. 22nd Internat. Conf. World Wide Web, 285–294.Google Scholar
  • Dam SK, Hong CS, Qiao Y, Zhang C (2024) A Complete Survey on LLM-based AI Chatbots. Preprint, submitted June 17, https://arxiv.org/abs/2406.16937.Google Scholar
  • Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error‐rates using the EM algorithm. J. Roy. Statist. Soc. Ser. C Appl. Statist. 28(1):20–28.Google Scholar
  • Dekel O, Shamir O (2009) Vox populi: Collecting high-quality labels from a crowd. 22nd Annual Conf. Learn. Theory (COLT) Proc.Google Scholar
  • Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint, submitted October 11, https://arxiv.org/abs/1810.04805.Google Scholar
  • d’Hoffschmidt M, Belblidia W, Brendlé T, Heinrich Q, Vidal M (2020) FQuAD: French question answering dataset. Preprint, submitted February 14, https://arxiv.org/abs/2002.06071.Google Scholar
  • Dong W, Saar-Tsechansky M, Geva T (2024) A machine learning framework for assessing experts’ decision quality. Management Sci. 71(7):5696–5721.LinkGoogle Scholar
  • Dzikovska MO, Nielsen R, Brew C (2012) Towards effective tutorial feedback for explanation questions: A dataset and baselines. Proc. 2012 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 200–210.Google Scholar
  • Fernández-Pichel M, Losada DE, Pichel JC (2022) A multistage retrieval system for health-related misinformation detection. Engrg. Appl. Artificial Intelligence 115:105211.CrossrefGoogle Scholar
  • Fernández-Pichel M, Losada DE, Pichel JC, Elsweiler D (2020) CiTIUS at the TREC 2020 health misinformation track. TREC 2020 Proc.Google Scholar
  • Galhardi LB, Brancher JD (2018) Machine learning approach for automatic short answer grading: A systematic review. Ibero-Amer. Conf. Artificial Intelligence (Springer, Cham, Switzerland), 380–391.Google Scholar
  • Gao T, Yao X, Chen D (2021) SimCSE: Simple contrastive learning of sentence embeddings. Preprint, submitted April 18, https://arxiv.org/abs/2104.08821.Google Scholar
  • Geva T, Saar-Tsechansky M (2016) Who’s a good decision maker? Data-driven expert worker ranking under unobservable quality. Proc. 37th Internat. Conf. Inform. Systems (Association for Information Systems, Atlanta).Google Scholar
  • Geva T, Saar‐Tsechansky M (2021) Who is a better decision maker? Data‐driven expert ranking under unobserved quality. Production Oper. Management 30(1):127–144.CrossrefGoogle Scholar
  • Geva T, Saar-Tsechansky M, Lustiger H (2019) More for less: Adaptive labeling payments in online labor markets. Data Mining Knowledge Discovery 33(6):1625–1673.CrossrefGoogle Scholar
  • Gomaa WH, Fahmy AA (2012) Short answer grading using string similarity and corpus-based similarity. Internat. J. Advanced Comput. Sci. Appl. 3(11).Google Scholar
  • Guo Z, Schlichtkrull M, Vlachos A (2022) A survey on automated fact-checking. Trans. Assoc. Comput. Linguistics. 10:178–206.CrossrefGoogle Scholar
  • Gütl C (2008) Moving towards a fully automatic knowledge assessment tool. Internat. J. Emerging Tech. Learn. 3(1).Google Scholar
  • Hadi MU, Qureshi R, Shah A, Irfan M, Zafar A, Shaikh MB, Akhtar N, Wu J, Mirjalili S, Shah M (2025) Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Preprint, submitted February 10, http://dx.doi.org/10.36227/techrxiv.23589741.v8.Google Scholar
  • Haller S, Aldea A, Seifert C, Strisciuglio N (2022) Survey on automated short answer grading with deep learning: From word embeddings to transformers. Preprint, submitted March 11, https://arxiv.org/abs/2204.03503.Google Scholar
  • Hanneke S, Green Larsen K, Zhivotovskiy N (2024) Revisiting agnostic PAC learning. Proc. 65th IEEE Annual Sympos. Foundations Comput. Sci., 1968–1982.Google Scholar
  • Hastie T, Tibshirani R, Friedman JH (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York).CrossrefGoogle Scholar
  • Heilman M, Madnani N (2015) The impact of training data on automated short answer scoring performance. Proc. 10th Workshop Innovative Use NLP Building Ed. Appl. (ACL, Stroudsburg, PA), 81–85.Google Scholar
  • Heinecke S, Reyzin L (2019) Crowdsourced PAC learning under classification noise. Proc. Seventh AAAI Conf. Human Comput. Crowdsourcing (AAAI, Palo Alto, CA), 41–49.Google Scholar
  • Hevner AR, March ST, Park J, Ram S (2004) Design science in information systems research. MIS Quart. 28(1):75–105.CrossrefGoogle Scholar
  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58(301):13–30.CrossrefGoogle Scholar
  • Horbach A, Pinkal M (2018) Semi-supervised clustering for short answer scoring. Proc. 11th Internat. Conf. Language Resources Evaluation (ACL, Stroudsburg, PA).Google Scholar
  • Ipeirotis PG, Provost F, Sheng VS, Wang J (2014) Repeated labeling using multiple noisy labelers. Data Mining Knowledge Discovery 28(2):402–441.CrossrefGoogle Scholar
  • Jordan S (2012) Short-answer e-assessment questions: Five years on. Whitelock D, Wills G, Warburton B, eds. Proc. 15th Internat. Comput. Assisted Assessment Conf. (Southampton).Google Scholar
  • Joshi M, Choi E, Weld DS, Zettlemoyer L (2017) TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. Preprint, submitted May 9, https://arxiv.org/abs/1705.03551.Google Scholar
  • Karchmer A (2024) Agnostic membership query learning with nontrivial savings: New results and techniques. Proc. 35th Internat. Conf. Algorithmic Learn. Theory. Proc. Machine Learn. Res., vol. 237 (PMLR, New York), 654–682.Google Scholar
  • Karger DR, Oh S, Shah D (2014) Budget-optimal task allocation for reliable crowdsourcing systems. Oper. Res. 62(1):1–24.Google Scholar
  • Kearns MJ, Vazirani U (1994) An Introduction to Computational Learning Theory (MIT Press, Cambridge, MA).CrossrefGoogle Scholar
  • Khetan A, Lipton ZC, Anandkumar A (2017) Learning from noisy singly-labeled data. Preprint, submitted December 13, https://arxiv.org/abs/1712.04577.Google Scholar
  • Klein R, Kyrilov A, Tokman M (2011) Automated assessment of short free-text responses in computer science using latent semantic analysis. Proc. 16th Annual Joint Conf. Innovation Tech. Comput. Sci. Ed. (ACM, New York), 158–162.Google Scholar
  • Kittur A, Nickerson JV, Bernstein M, Gerber E, Shaw A, Zimmerman J, Lease M, Horton J (2013) The future of crowd work. Proc. 2013 Conf. Comput. Supported Cooperative Work (Association for Computing Machinery, New York), 1301–1318.Google Scholar
  • Kočiský T, Schwarz J, Blunsom P, Dyer C, Hermann KM, Melis G, Grefenstette E (2018) The narrative QA reading comprehension challenge. Trans. Assoc. Comput. Linguistics 6:317–328.CrossrefGoogle Scholar
  • Kumar A, Lease M (2011) Modeling annotator accuracies for supervised learning. Proc. Workshop Crowdsourcing Search Data Mining Fourth ACM Internat Conf Web Search Data Mining (ACM, New York), 19–22.Google Scholar
  • Kwiatkowski T, Palomaki J, Redfield O, Collins M, Parikh A, Alberti C, Epstein D, et al. (2019) Natural questions: A benchmark for question answering research. Trans. Assoc. Comput. Linguistics 7:452–466.Google Scholar
  • Larsen KG (2023) Bagging is an optimal PAC learner. Proc. 36th Annual Conf. Learn. Theory Proc. Machine Learn. Res., vol. 195 (PMLR, New York), 1–20.Google Scholar
  • Leacock C, Chodorow M (2003) C-rater: Automated scoring of short-answer questions. Comput. Humanities 37(4):389–405.CrossrefGoogle Scholar
  • Lee S, Kang M, Lee J, Hwang SJ (2021) Learning to perturb word embeddings for out-of-distribution QA. Preprint, submitted May 6, https://arxiv.org/abs/2105.02692.Google Scholar
  • Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2019) Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Preprint, submitted October 29, https://arxiv.org/1910.13461.Google Scholar
  • Li J (2020) Crowdsourced text sequence aggregation based on hybrid reliability and representation. Proc. 43rd Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (Association for Computing Machinery, New York), 1761–1764.Google Scholar
  • Li J, Fukumoto F (2019) A dataset of crowdsourced word sequences: Collections and answer aggregation for ground truth creation. Proc. First Workshop Aggregating Analysing Crowdsourced Annotations NLP (Association for Computational Linguistics, Stroudsburg, PA), 24–28.Google Scholar
  • Liang P, Bommasani R, Lee T, Tsipras D, Soylu D, Yasunaga M, Zhang Y, et al. (2023) Holistic evaluation of language models. Preprint, submitted November 16, 2022, https://arxiv.org/2211.09110.Google Scholar
  • Lima LC, Wright DB, Augenstein I, Maistro M (2021) University of Copenhagen participation in TREC Health Misinformation track 2020. Preprint, submitted March 3, https://arxiv.org/2103.02462.Google Scholar
  • Lin S, Hilton J, Evans O (2022) Truthfulqa: Measuring how models mimic human falsehoods. Preprint, submitted September 8, 2021, https://arxiv.org/2109.07958.Google Scholar
  • Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A robustly optimized BERT pretraining approach. Preprint, submitted July 26, https://arxiv.org/1907.11692.Google Scholar
  • Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. Proc. 12th Conf. Eur. Chapter ACL (ACL, Stroudsburg, PA), 567–575.Google Scholar
  • Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. Proc. 49th Annual Meeting Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 752–762.Google Scholar
  • Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of Machine Learning (MIT Press, Cambridge, MA).Google Scholar
  • Möller T, Risch J, Pietsch M (2021) GermanQuAD and GermanDPR: Improving non-English question answering and passage retrieval. Preprint, submitted April 26, https://arxiv.org/2104.12741.Google Scholar
  • Nguyen T, Rosenberg M, Song X, Gao J, Tiwary S, Majumder R, Deng L (2016) MS MARCO: A human generated machine reading comprehension dataset. Workshop Adv. Neural Inform. Processing Systems (CEUR-WS.org).Google Scholar
  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: A method for automatic evaluation of machine translation. Proc. 40th Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 311–318.Google Scholar
  • Polat M (2020) Analysis of multiple-choice versus open-ended questions in language tests according to different cognitive domain levels. Novitas-ROYAL 14(2):76–96.Google Scholar
  • Pradeep R, Ma X, Nogueira R, Lin J (2021) Vera: Prediction techniques for reducing harmful misinformation in consumer health search. Proc. 44th Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (ACM, New York), 2066–2070.Google Scholar
  • Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: Unanswerable questions for SQuAD. Preprint, submitted June 11, https://arxiv.org/1806.03822.Google Scholar
  • Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. Proc. 2016 Conf. Empirical Methods Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 2383–2392.Google Scholar
  • Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. Proc. 10th Workshop Innovative Use NLP Building Ed. Appl. (ACL, Stroudsburg, PA), 97–106.Google Scholar
  • Ramesh D, Sanampudi SK (2022) An automated essay scoring systems: A systematic literature review. Artificial Intelligence Rev. 55(3):2495–2527.CrossrefGoogle Scholar
  • Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J. Machine Learn. Res. 11(43):1297–1322.Google Scholar
  • Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Preprint, submitted August 27, https://arxiv.org/1908.10084.Google Scholar
  • Rodrigues F, Pereira F, Ribeiro B (2013) Learning from multiple annotators: Distinguishing good from random labelers. Pattern Recognition Lett. 34(12):1428–1436.CrossrefGoogle Scholar
  • Roy S, Narahari Y, Deshmukh OD (2015) A perspective on computer assisted assessment techniques for short free-text answers. Internat. Comput. Assisted Assessment Conf. (Springer, Cham, Switzerland), 96–109.Google Scholar
  • Roy S, Dandapat S, Nagesh A, Narahari Y (2016) Wisdom of students: A consistent automatic short answer grading technique. Proc. 13th Internat. Conf. Natural Language Processing (ACL, Stroudsburg, PA), 178–187.Google Scholar
  • Saha S, Dhamecha TI, Marvaniya S, Sindhgatta R, Sengupta B (2018) Sentence level or token level features for automatic short answer grading? Use both. Internat. Conf. Artificial Intelligence Ed. (Springer, Cham, Switzerland), 503–517.Google Scholar
  • Singh P, Sheorain S, Tomar S, Sharma S, Bansode NK (2018) Descriptive answer evaluation. Internat. Res. J. Engrg. Tech. 5(5):2709–2712.Google Scholar
  • Steimel K, Riordan B (2020) Towards instance-based content scoring with pre-trained transformer models. Proc. 34th AAAI Conf. Artificial Intelligence (AAAI, Palo Alto, CA).Google Scholar
  • Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. Proc. 2016 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 1070–1075.Google Scholar
  • Sung C, Dhamecha TI, Mukhi N (2019) Improving short answer grading using transformer-based pre-training. Internat. Conf. Artificial Intelligence Ed. (Springer, Cham, Switzerland), 469–481.Google Scholar
  • Tanno R, Saeedi A, Sankaranarayanan S, Alexander DC, Silberman N (2019) Learning from noisy labels by regularized estimation of annotator confusion. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 11244–11253.Google Scholar
  • Valiant LG (1984) A theory of the learnable. Comm. ACM 27(11):1134–1142.CrossrefGoogle Scholar
  • Wang B, Asan O, Mansouri M (2023a) Perspectives of patients with chronic diseases on future acceptance of AI–based home care systems: Cross-sectional web-based survey study. JMIR Human Factors 10(1):e49788.CrossrefGoogle Scholar
  • Wang J, Ipeirotis PG, Provost F (2017) Cost-effective quality assurance in crowd labeling. Inform. Systems Res. 28(1):137–158.LinkGoogle Scholar
  • Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, Chowdhery A, Zhou D (2022) Self-consistency improves chain of thought reasoning in language models. Preprint, submitted March 21, https://arxiv.org/2203.11171.Google Scholar
  • Wang P, Li L, Chen L, Cai Z, Zhu D, Lin B, Cao Y, Liu Q, Liu T, Sui Z (2023b) Large language models are not fair evaluators. Preprint, submitted May 29, https://arxiv.org/2305.17926.Google Scholar
  • Warfield SK, Zou KH, Wells WM (2004) Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Trans. Medical Imaging 23(7):903–921.CrossrefGoogle Scholar
  • Wauthier FL, Jordan M (2011) Bayesian bias mitigation for crowdsourcing. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates Inc., Red Hook, NY), 1800–1808.Google Scholar
  • Whitehill J, Wu TF, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Adv. Neural Inform. Processing Systems, vol. 22 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Widyassari AP, Rustad S, Shidik GF, Noersasongko E, Syukur A, Affandy A (2022) Review of automatic text summarization techniques & methods. J. King Saud Univ. Comput. Inform. Sci. 34(4):1029–1046.CrossrefGoogle Scholar
  • Xia L, Guan M, Liu J, Cao X, Luo D (2021) Attention-based bidirectional long short-term memory neural network for short answer scoring. Guan M, Na Z, eds. Internat. Conf. Machine Learn. Intelligent Comm. (Springer, Cham, Switzerland), 104–112.CrossrefGoogle Scholar
  • Yin J, Luo J, Brown SA (2021) Learning from crowdsourced multi-labeling: A variational Bayesian approach. Inform. Systems Res. 32(3):752–773.AbstractGoogle Scholar
  • Zeng S, Shen J (2022) Efficient PAC learning from the crowd with pairwise comparisons. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 25973–25993.Google Scholar
  • Zeng S, Shen J (2023) Semi-verified PAC learning from the crowd. Proc. 26th Internat. Conf. Artificial Intelligence. Statist. (PMLR, New York), 2068–2086.Google Scholar
  • Zesch T, Heilman M, Cahill A (2015) Reducing annotation efforts in supervised short answer scoring. Proc. 10th Workshop Innovative Use NLP Building Ed. Appl. (ACL, Stroudsburg, PA), 124–132.Google Scholar
  • Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y (2019) BERTscore: Evaluating text generation with BERT. Preprint, submitted April 21, https://arxiv.org/1904.09675.Google Scholar
  • Zhang L, Zhang J, Ke X, Li H, Huang X, Shao Z, Cao S, Lv X (2023) A survey on complex factual question answering. AI Open 4:1–12.CrossrefGoogle Scholar
  • Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, et al. (2023) Judging LLM-as-a-judge with MT-bench and chatbot arena. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 46595–46623.Google Scholar
  • Zhou T, Li S (2025) Understanding user switch of information seeking: From search engines to generative AI. J. Librarianship Inform. Sci. Forthcoming.Google Scholar
  • Zhu P, Wang Z, Hauff C, Yang J, Anand A (2022) Answer quality aware aggregation for extractive QA crowdsourcing. Findings Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 6147–6159.Google Scholar
  • Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, Christiano P, Irving G (2019) Fine-tuning language models from human preferences. Preprint, submitted September 18, https://arxiv.org/1909.08593.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.