Abbasi A, Parsons J, Pant G, Sheng ORL, Sarker S (2024) Pathways for design research on artificial intelligence. Inform. Systems Res. 35(2):441–459.Link, Google Scholar
Abedissa T, Usbeck R, Assabie Y (2023) AmQA: Amharic question answering dataset. Preprint, submitted March 6, https://arxiv.org/abs/2303.03290.Google Scholar
Abualsaud M, Chen IX, Ghajar K, Minh LNL, Smucker MD, Tahami AV, Zhang D (2021) UWaterlooMDS at the TREC 2021 health misinformation track. 30th Text REtrieval Conf. Proc. (National Institute of Standards and Technology, Gaithersburg, MD).‏Google Scholar
Alfonseca E, Pérez D (2004) Automatic assessment of open ended questions with a BLEU-inspired algorithm and shallow NLP. Vicedo JL, Martínez-Barco P, Muńoz R, Saiz Noeda M, eds. Advances in Natural Language Processing. EsTAL 2004, Lecture Notes in Computer Science, vol. 3230 (Springer, Berlin, Heidelberg).Google Scholar
Awasthi P, Blum A, Haghtalab N, Mansour Y (2017) Efficient PAC learning from the crowd. Proc. 2017 Conf. Learn. Theory, vol. 65 (PMLR, New York), 127–150.Google Scholar
Bajaj P, Campos D, Craswell N, Deng L, Gao J, Liu X, Majumder R, et al. (2016) MS MARCO: A human generated machine reading comprehension dataset. Preprint, submitted November 28, https://arxiv.org/abs/1611.09268.Google Scholar
Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on freebase from question-answer pairs. Proc. 2013 Conf. Empirical Methods Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 1533–1544.Google Scholar
Bishop CM (2006) Pattern Recognition and Machine Learning (Springer, New York).Google Scholar
Bondarenko A, Fröbe M, Kasturia V, Hagen M, Völske M, Stein B (2019) Webis at TREC 2019: Decision Track. TREC 2019 Proc. (National Institute of Standards and Technology, Gaithersburg, MD).Google Scholar
Bonthu S, Rama Sree S, Krishna Prasad MHM (2021) Automated short answer grading using deep learning: A survey. Internat. Cross-Domain Conf. Machine Learn. Knowledge Extraction (Springer, Berlin, Heidelberg), 61–78.Google Scholar
Brand C, Ganian R, Simonov K (2023) A parameterized theory of PAC learning. Proc. AAAI Conf. Artificial Intelligence, vol. 37 (AAAI Press, Palo Alto, CA), 6834–6841.Google Scholar
Branson S, van Horn G, Perona P (2017) Lean crowdsourcing: Combining humans and machines in an online system. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 7474–7483.Google Scholar
Braylan A, Lease M (2020) Modeling and aggregation of complex annotations via annotation distances. Proc. Web Conf. (Association for Computing Machinery, New York), 1807–1818.Google Scholar
Braylan A, Alonso O, Lease M (2022) Measuring annotator agreement generally across complex structured, multi-object, and free-text annotation tasks. Proc. ACM Web Conf. (Association for Computing Machinery, New York), 1720–1730.Google Scholar
Burrows S, Gurevych I, Stein B (2015) The eras and trends of automatic short answer grading. Internat. J. Artificial Intelligence Ed. 25(1):60–117.Crossref, Google Scholar
Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, St. John R, Constant N, et al. (2018) Universal sentence encoder. Preprint, submitted March 29, https://arxiv.org/abs/1803.11175.Google Scholar
Chai L, Sun H, Wang Z (2022) An error consistency based approach to answer aggregation in open-ended crowdsourcing. Inform. Sci. 608:1029–1044.Crossref, Google Scholar
Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, Chen H, et al. (2024) A survey on evaluation of large language models. ACM Trans. Intelligent Systems Tech. 15(3):1–45.Crossref, Google Scholar
Chen X, Aksitov R, Alon U, Ren J, Xiao K, Yin P, Prakash S, Sutton C, Wang X, Zhou D (2023) Universal self-consistency for large language model generation. Preprint, submitted November 29, https://arxiv.org/abs/2311.17311.Google Scholar
Chiang CH, Lee HY (2023) Can large language models be an alternative to human evaluations? Preprint, submitted May 3, https://arxiv.org/abs/2305.01937.Google Scholar
Clarke CLA, Maistro M, Smucker MD (2022) Overview of the TREC 2021 health misinformation track. Proc. Thirtieth Text Retrieval Conf. (TREC 2021), Special Publication 500-335 (National Institute of Standards and Technology (NIST), Washington, DC).Google Scholar
Clarke CL, Rizvi S, Smucker MD, Maistro M, Zuccon G (2020) Overview of the TREC 2020 health misinformation track. TREC 2020 Proc.‏Google Scholar
Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. Proc. 22nd Internat. Conf. World Wide Web, 285–294.Google Scholar
Dam SK, Hong CS, Qiao Y, Zhang C (2024) A Complete Survey on LLM-based AI Chatbots. Preprint, submitted June 17, https://arxiv.org/abs/2406.16937.Google Scholar
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error‐rates using the EM algorithm. J. Roy. Statist. Soc. Ser. C Appl. Statist. 28(1):20–28.Google Scholar
Dekel O, Shamir O (2009) Vox populi: Collecting high-quality labels from a crowd. 22nd Annual Conf. Learn. Theory (COLT) Proc.Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint, submitted October 11, https://arxiv.org/abs/1810.04805.Google Scholar
d’Hoffschmidt M, Belblidia W, Brendlé T, Heinrich Q, Vidal M (2020) FQuAD: French question answering dataset. Preprint, submitted February 14, https://arxiv.org/abs/2002.06071.Google Scholar
Dong W, Saar-Tsechansky M, Geva T (2024) A machine learning framework for assessing experts’ decision quality. Management Sci. 71(7):5696–5721.Link, Google Scholar
Dzikovska MO, Nielsen R, Brew C (2012) Towards effective tutorial feedback for explanation questions: A dataset and baselines. Proc. 2012 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 200–210.Google Scholar
Fernández-Pichel M, Losada DE, Pichel JC (2022) A multistage retrieval system for health-related misinformation detection. Engrg. Appl. Artificial Intelligence 115:105211.Crossref, Google Scholar
Fernández-Pichel M, Losada DE, Pichel JC, Elsweiler D (2020) CiTIUS at the TREC 2020 health misinformation track. TREC 2020 Proc.Google Scholar
Galhardi LB, Brancher JD (2018) Machine learning approach for automatic short answer grading: A systematic review. Ibero-Amer. Conf. Artificial Intelligence (Springer, Cham, Switzerland), 380–391.Google Scholar
Gao T, Yao X, Chen D (2021) SimCSE: Simple contrastive learning of sentence embeddings. Preprint, submitted April 18, https://arxiv.org/abs/2104.08821.Google Scholar
Geva T, Saar-Tsechansky M (2016) Who’s a good decision maker? Data-driven expert worker ranking under unobservable quality. Proc. 37th Internat. Conf. Inform. Systems (Association for Information Systems, Atlanta).Google Scholar
Geva T, Saar‐Tsechansky M (2021) Who is a better decision maker? Data‐driven expert ranking under unobserved quality. Production Oper. Management 30(1):127–144.Crossref, Google Scholar
Geva T, Saar-Tsechansky M, Lustiger H (2019) More for less: Adaptive labeling payments in online labor markets. Data Mining Knowledge Discovery 33(6):1625–1673.Crossref, Google Scholar
Gomaa WH, Fahmy AA (2012) Short answer grading using string similarity and corpus-based similarity. Internat. J. Advanced Comput. Sci. Appl. 3(11).Google Scholar
Guo Z, Schlichtkrull M, Vlachos A (2022) A survey on automated fact-checking. Trans. Assoc. Comput. Linguistics. 10:178–206.Crossref, Google Scholar
Gütl C (2008) Moving towards a fully automatic knowledge assessment tool. Internat. J. Emerging Tech. Learn. 3(1).Google Scholar
Hadi MU, Qureshi R, Shah A, Irfan M, Zafar A, Shaikh MB, Akhtar N, Wu J, Mirjalili S, Shah M (2025) Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Preprint, submitted February 10, http://dx.doi.org/10.36227/techrxiv.23589741.v8.Google Scholar
Haller S, Aldea A, Seifert C, Strisciuglio N (2022) Survey on automated short answer grading with deep learning: From word embeddings to transformers. Preprint, submitted March 11, https://arxiv.org/abs/2204.03503.Google Scholar
Hanneke S, Green Larsen K, Zhivotovskiy N (2024) Revisiting agnostic PAC learning. Proc. 65th IEEE Annual Sympos. Foundations Comput. Sci., 1968–1982.Google Scholar
Hastie T, Tibshirani R, Friedman JH (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York).Crossref, Google Scholar
Heilman M, Madnani N (2015) The impact of training data on automated short answer scoring performance. Proc. 10th Workshop Innovative Use NLP Building Ed. Appl. (ACL, Stroudsburg, PA), 81–85.Google Scholar
Heinecke S, Reyzin L (2019) Crowdsourced PAC learning under classification noise. Proc. Seventh AAAI Conf. Human Comput. Crowdsourcing (AAAI, Palo Alto, CA), 41–49.Google Scholar
Hevner AR, March ST, Park J, Ram S (2004) Design science in information systems research. MIS Quart. 28(1):75–105.Crossref, Google Scholar
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58(301):13–30.Crossref, Google Scholar
Horbach A, Pinkal M (2018) Semi-supervised clustering for short answer scoring. Proc. 11th Internat. Conf. Language Resources Evaluation (ACL, Stroudsburg, PA).Google Scholar
Ipeirotis PG, Provost F, Sheng VS, Wang J (2014) Repeated labeling using multiple noisy labelers. Data Mining Knowledge Discovery 28(2):402–441.Crossref, Google Scholar
Jordan S (2012) Short-answer e-assessment questions: Five years on. Whitelock D, Wills G, Warburton B, eds. Proc. 15th Internat. Comput. Assisted Assessment Conf. (Southampton).Google Scholar
Joshi M, Choi E, Weld DS, Zettlemoyer L (2017) TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. Preprint, submitted May 9, https://arxiv.org/abs/1705.03551.Google Scholar
Karchmer A (2024) Agnostic membership query learning with nontrivial savings: New results and techniques. Proc. 35th Internat. Conf. Algorithmic Learn. Theory. Proc. Machine Learn. Res., vol. 237 (PMLR, New York), 654–682.Google Scholar
Karger DR, Oh S, Shah D (2014) Budget-optimal task allocation for reliable crowdsourcing systems. Oper. Res. 62(1):1–24.Google Scholar
Kearns MJ, Vazirani U (1994) An Introduction to Computational Learning Theory (MIT Press, Cambridge, MA).Crossref, Google Scholar
Khetan A, Lipton ZC, Anandkumar A (2017) Learning from noisy singly-labeled data. Preprint, submitted December 13, https://arxiv.org/abs/1712.04577.Google Scholar
Klein R, Kyrilov A, Tokman M (2011) Automated assessment of short free-text responses in computer science using latent semantic analysis. Proc. 16th Annual Joint Conf. Innovation Tech. Comput. Sci. Ed. (ACM, New York), 158–162.Google Scholar
Kittur A, Nickerson JV, Bernstein M, Gerber E, Shaw A, Zimmerman J, Lease M, Horton J (2013) The future of crowd work. Proc. 2013 Conf. Comput. Supported Cooperative Work (Association for Computing Machinery, New York), 1301–1318.Google Scholar
Kočiský T, Schwarz J, Blunsom P, Dyer C, Hermann KM, Melis G, Grefenstette E (2018) The narrative QA reading comprehension challenge. Trans. Assoc. Comput. Linguistics 6:317–328.Crossref, Google Scholar
Kumar A, Lease M (2011) Modeling annotator accuracies for supervised learning. Proc. Workshop Crowdsourcing Search Data Mining Fourth ACM Internat Conf Web Search Data Mining (ACM, New York), 19–22.Google Scholar
Kwiatkowski T, Palomaki J, Redfield O, Collins M, Parikh A, Alberti C, Epstein D, et al. (2019) Natural questions: A benchmark for question answering research. Trans. Assoc. Comput. Linguistics 7:452–466.Google Scholar
Larsen KG (2023) Bagging is an optimal PAC learner. Proc. 36th Annual Conf. Learn. Theory Proc. Machine Learn. Res., vol. 195 (PMLR, New York), 1–20.Google Scholar
Leacock C, Chodorow M (2003) C-rater: Automated scoring of short-answer questions. Comput. Humanities 37(4):389–405.Crossref, Google Scholar
Lee S, Kang M, Lee J, Hwang SJ (2021) Learning to perturb word embeddings for out-of-distribution QA. Preprint, submitted May 6, https://arxiv.org/abs/2105.02692.Google Scholar
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2019) Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Preprint, submitted October 29, https://arxiv.org/1910.13461.Google Scholar
Li J (2020) Crowdsourced text sequence aggregation based on hybrid reliability and representation. Proc. 43rd Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (Association for Computing Machinery, New York), 1761–1764.Google Scholar
Li J, Fukumoto F (2019) A dataset of crowdsourced word sequences: Collections and answer aggregation for ground truth creation. Proc. First Workshop Aggregating Analysing Crowdsourced Annotations NLP (Association for Computational Linguistics, Stroudsburg, PA), 24–28.Google Scholar
Liang P, Bommasani R, Lee T, Tsipras D, Soylu D, Yasunaga M, Zhang Y, et al. (2023) Holistic evaluation of language models. Preprint, submitted November 16, 2022, https://arxiv.org/2211.09110.Google Scholar
Lima LC, Wright DB, Augenstein I, Maistro M (2021) University of Copenhagen participation in TREC Health Misinformation track 2020. Preprint, submitted March 3, https://arxiv.org/2103.02462.Google Scholar
Lin S, Hilton J, Evans O (2022) Truthfulqa: Measuring how models mimic human falsehoods. Preprint, submitted September 8, 2021, https://arxiv.org/2109.07958.Google Scholar
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A robustly optimized BERT pretraining approach. Preprint, submitted July 26, https://arxiv.org/1907.11692.Google Scholar
Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. Proc. 12th Conf. Eur. Chapter ACL (ACL, Stroudsburg, PA), 567–575.Google Scholar
Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. Proc. 49th Annual Meeting Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 752–762.Google Scholar
Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of Machine Learning (MIT Press, Cambridge, MA).Google Scholar
Möller T, Risch J, Pietsch M (2021) GermanQuAD and GermanDPR: Improving non-English question answering and passage retrieval. Preprint, submitted April 26, https://arxiv.org/2104.12741.Google Scholar
Nguyen T, Rosenberg M, Song X, Gao J, Tiwary S, Majumder R, Deng L (2016) MS MARCO: A human generated machine reading comprehension dataset. Workshop Adv. Neural Inform. Processing Systems (CEUR-WS.org).Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: A method for automatic evaluation of machine translation. Proc. 40th Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 311–318.Google Scholar
Polat M (2020) Analysis of multiple-choice versus open-ended questions in language tests according to different cognitive domain levels. Novitas-ROYAL 14(2):76–96.Google Scholar
Pradeep R, Ma X, Nogueira R, Lin J (2021) Vera: Prediction techniques for reducing harmful misinformation in consumer health search. Proc. 44th Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (ACM, New York), 2066–2070.Google Scholar
Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: Unanswerable questions for SQuAD. Preprint, submitted June 11, https://arxiv.org/1806.03822.Google Scholar
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. Proc. 2016 Conf. Empirical Methods Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 2383–2392.Google Scholar
Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. Proc. 10th Workshop Innovative Use NLP Building Ed. Appl. (ACL, Stroudsburg, PA), 97–106.Google Scholar
Ramesh D, Sanampudi SK (2022) An automated essay scoring systems: A systematic literature review. Artificial Intelligence Rev. 55(3):2495–2527.Crossref, Google Scholar
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J. Machine Learn. Res. 11(43):1297–1322.Google Scholar
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Preprint, submitted August 27, https://arxiv.org/1908.10084.Google Scholar
Rodrigues F, Pereira F, Ribeiro B (2013) Learning from multiple annotators: Distinguishing good from random labelers. Pattern Recognition Lett. 34(12):1428–1436.Crossref, Google Scholar
Roy S, Narahari Y, Deshmukh OD (2015) A perspective on computer assisted assessment techniques for short free-text answers. Internat. Comput. Assisted Assessment Conf. (Springer, Cham, Switzerland), 96–109.Google Scholar
Roy S, Dandapat S, Nagesh A, Narahari Y (2016) Wisdom of students: A consistent automatic short answer grading technique. Proc. 13th Internat. Conf. Natural Language Processing (ACL, Stroudsburg, PA), 178–187.Google Scholar
Saha S, Dhamecha TI, Marvaniya S, Sindhgatta R, Sengupta B (2018) Sentence level or token level features for automatic short answer grading? Use both. Internat. Conf. Artificial Intelligence Ed. (Springer, Cham, Switzerland), 503–517.Google Scholar
Singh P, Sheorain S, Tomar S, Sharma S, Bansode NK (2018) Descriptive answer evaluation. Internat. Res. J. Engrg. Tech. 5(5):2709–2712.Google Scholar
Steimel K, Riordan B (2020) Towards instance-based content scoring with pre-trained transformer models. Proc. 34th AAAI Conf. Artificial Intelligence (AAAI, Palo Alto, CA).Google Scholar
Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. Proc. 2016 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 1070–1075.Google Scholar
Sung C, Dhamecha TI, Mukhi N (2019) Improving short answer grading using transformer-based pre-training. Internat. Conf. Artificial Intelligence Ed. (Springer, Cham, Switzerland), 469–481.Google Scholar
Tanno R, Saeedi A, Sankaranarayanan S, Alexander DC, Silberman N (2019) Learning from noisy labels by regularized estimation of annotator confusion. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 11244–11253.Google Scholar
Valiant LG (1984) A theory of the learnable. Comm. ACM 27(11):1134–1142.Crossref, Google Scholar
Wang B, Asan O, Mansouri M (2023a) Perspectives of patients with chronic diseases on future acceptance of AI–based home care systems: Cross-sectional web-based survey study. JMIR Human Factors 10(1):e49788.Crossref, Google Scholar
Wang J, Ipeirotis PG, Provost F (2017) Cost-effective quality assurance in crowd labeling. Inform. Systems Res. 28(1):137–158.Link, Google Scholar
Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, Chowdhery A, Zhou D (2022) Self-consistency improves chain of thought reasoning in language models. Preprint, submitted March 21, https://arxiv.org/2203.11171.Google Scholar
Wang P, Li L, Chen L, Cai Z, Zhu D, Lin B, Cao Y, Liu Q, Liu T, Sui Z (2023b) Large language models are not fair evaluators. Preprint, submitted May 29, https://arxiv.org/2305.17926.Google Scholar
Warfield SK, Zou KH, Wells WM (2004) Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Trans. Medical Imaging 23(7):903–921.Crossref, Google Scholar
Wauthier FL, Jordan M (2011) Bayesian bias mitigation for crowdsourcing. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates Inc., Red Hook, NY), 1800–1808.Google Scholar
Whitehill J, Wu TF, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Adv. Neural Inform. Processing Systems, vol. 22 (Curran Associates Inc., Red Hook, NY).Google Scholar
Widyassari AP, Rustad S, Shidik GF, Noersasongko E, Syukur A, Affandy A (2022) Review of automatic text summarization techniques & methods. J. King Saud Univ. Comput. Inform. Sci. 34(4):1029–1046.Crossref, Google Scholar
Xia L, Guan M, Liu J, Cao X, Luo D (2021) Attention-based bidirectional long short-term memory neural network for short answer scoring. Guan M, Na Z, eds. Internat. Conf. Machine Learn. Intelligent Comm. (Springer, Cham, Switzerland), 104–112.Crossref, Google Scholar
Yin J, Luo J, Brown SA (2021) Learning from crowdsourced multi-labeling: A variational Bayesian approach. Inform. Systems Res. 32(3):752–773.Abstract, Google Scholar
Zeng S, Shen J (2022) Efficient PAC learning from the crowd with pairwise comparisons. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 25973–25993.Google Scholar
Zeng S, Shen J (2023) Semi-verified PAC learning from the crowd. Proc. 26th Internat. Conf. Artificial Intelligence. Statist. (PMLR, New York), 2068–2086.Google Scholar
Zesch T, Heilman M, Cahill A (2015) Reducing annotation efforts in supervised short answer scoring. Proc. 10th Workshop Innovative Use NLP Building Ed. Appl. (ACL, Stroudsburg, PA), 124–132.Google Scholar
Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y (2019) BERTscore: Evaluating text generation with BERT. Preprint, submitted April 21, https://arxiv.org/1904.09675.Google Scholar
Zhang L, Zhang J, Ke X, Li H, Huang X, Shao Z, Cao S, Lv X (2023) A survey on complex factual question answering. AI Open 4:1–12.Crossref, Google Scholar
Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, et al. (2023) Judging LLM-as-a-judge with MT-bench and chatbot arena. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates Inc., Red Hook, NY), 46595–46623.Google Scholar
Zhou T, Li S (2025) Understanding user switch of information seeking: From search engines to generative AI. J. Librarianship Inform. Sci. Forthcoming.Google Scholar
Zhu P, Wang Z, Hauff C, Yang J, Anand A (2022) Answer quality aware aggregation for extractive QA crowdsourcing. Findings Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 6147–6159.Google Scholar
Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, Christiano P, Irving G (2019) Fine-tuning language models from human preferences. Preprint, submitted September 18, https://arxiv.org/1909.08593.Google Scholar

cover image Information Systems Research

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:July 15, 2023
Accepted:June 14, 2025
Published Online:September 23, 2025

Cite as

Inbal Yahav, Anat Goldstein, Tomer Geva, Shahar Meir, Onn Shehory (2025) Quality Control for Crowd Workers and for Language Models: A Framework for Free-Text Response Evaluation with No Ground Truth. Information Systems Research 0(0).

https://doi.org/10.1287/isre.2023.0426

Keywords

Acknowledgments

The authors are grateful for the excellent comments and suggestions from the senior editor, associate editor, and three reviewers. The first three authors contributed equally and are listed in reverse alphabetical order.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Quality Control for Crowd Workers and for Language Models: A Framework for Free-Text Response Evaluation with No Ground Truth

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News