Bodnaruk A, Simonov A (2015) Do financial experts make better investment decisions? J. Financial Intermediation 24(4):514–536.Crossref, Google Scholar
Bonner SE (1994) A model of the effects of audit task complexity. Accounting Organ. Soc. 19(3):213–234.Crossref, Google Scholar
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J. Artificial Intelligence Res. 11(1):131–167.Crossref, Google Scholar
Burris ER (2012) The risks and rewards of speaking up: Managerial responses to employee voice. Acad. Management J. 55(4):851–875.Crossref, Google Scholar
Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning. IEEE Trans. Neural Networks 20(3):542–542.Crossref, Google Scholar
Chen Y, Kash I, Ruberry M, Shnayder V (2011) Decision markets with good incentives. Chen N, Elkind E, Koutsoupias E, eds. Internet and Network Economics. WINE 2011, Lecture Notes in Computer Science, vol. 7090 (Springer Berlin Heidelberg, Berlin, Heidelberg), 72–83.Google Scholar
Christoforaki M, Ipeirotis PG (2015) A system for scalable and reliable technical-skill testing in online labor markets. Comput. Networks 90:110–120.Crossref, Google Scholar
Croskerry P (2003) The importance of cognitive errors in diagnosis and strategies to minimize them. Acad. Medicine 78(8):775–780.Crossref, Google Scholar
Dai P, Lin CH, Mausam, Weld DS (2013) POMDP-based control of workflows for crowdsourcing. Artificial Intelligence 202:52–85.Crossref, Google Scholar
Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. Proc. 22nd Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 285–294.Google Scholar
Danziger S, Levav J, Avnaim-Pesso L (2011) Extraneous factors in judicial decisions. Proc. Natl. Acad. Sci. USA 108(17):6889–6892.Crossref, Google Scholar
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J. Roy. Statist. Soc. Ser. C Appl. Statist. 28(1):20–28.Google Scholar
Dekel O, Shamir O (2009) Vox populi: Collecting high-quality labels from a crowd. Proc. 22nd Conf. Comput. Learn. Theory (COLT) (Montreal, Canada), 377–386.Google Scholar
Dismukes RK, Berman BA, Loukopoulos L (2017) The Limits of Expertise: Rethinking Pilot Error and the Causes of Airline Accidents (Routledge, London).Crossref, Google Scholar
Dror IE, Charlton D (2006) Why experts make errors. J. Forensic Identification 56(4):600–616.Google Scholar
Dror IE, Charlton D, Péron AE (2006) Contextual information renders experts vulnerable to making erroneous identifications. Forensic Sci. Internat. 156(1):74–78.Crossref, Google Scholar
Dror IE, Pascual-Leone A, Ramachandran V, Cole J, Della Sala S, Manly T, Mayes A (2011) The paradox of human expertise: Why experts get it wrong. Kapur N, ed. The Paradoxical Brain (Cambridge University Press, Cambridge, UK), 177–188.Google Scholar
Dror IE, Peron AE, Hind SL, Charlton D (2005) When emotions get the better of us: The effect of contextual top-down processing on matching fingerprints. Appl. Cognitive Psych. 19(6):799–809.Crossref, Google Scholar
Fenton-O’Creevy M, Soane E, Nicholson N, Willman P (2011) Thinking, feeling and deciding: The influence of emotions on the decision making and performance of traders. J. Organ. Behav. 32(8):1044–1061.Crossref, Google Scholar
Ferris GR, Munyon TP, Basik K, Buckley MR (2008) The performance evaluation context: Social, emotional, cognitive, political, and relationship components. Human Resources Management Rev. 18(3):146–163.Crossref, Google Scholar
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Statist. 28(2):337–407.Crossref, Google Scholar
Gao R, Saar-Tsechansky M (2020) Cost-accuracy aware adaptive labeling for active learning. Proc. AAAI Conf. Artificial Intelligence 34(3):2569–2576.Google Scholar
Geva T, Saar-Tsechansky M (2021) Who is a better decision maker? Data-driven expert ranking under unobserved quality. Production Oper. Management 30(1):127–144.Crossref, Google Scholar
Geva T, Saar-Tsechansky M, Lustiger H (2019) More for less: Adaptive labeling payments in online labor markets. Data Mining Knowledge Discovery 33(6):1625–1673.Crossref, Google Scholar
Graber M, Gordon R, Franklin N (2002) Reducing diagnostic errors in medicine: What’s the goal? Acad. Medicine 77(10):981–992.Crossref, Google Scholar
Grote RC (2005) Forced Ranking: Making Performance Management Work (Harvard Business School Press, Boston).Google Scholar
Harris MM, Schaubroeck J (1988) A meta-analysis of self-supervisor, self-peer, and peer-supervisor ratings. Personnel Psych. 41(1):43–62.Crossref, Google Scholar
Hayward RA (2002) Counting deaths due to medical errors. JAMA 288(19):2404.Crossref, Google Scholar
Hopkins M, Reeber E, Forman G, Suermondt J (1999) Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.Google Scholar
Huang SJ, Chen JL, Mu X, Zhou ZH (2017) Cost-effective active learning from diverse labelers. Proc. 26th Internat. Joint Conf. Artificial Intelligence (IJCAI’17) (AAAI Press, Palo Alto, CA), 1879–1885.Google Scholar
Hutton RJ, Klein G (1999) Expert decision making. Systems Engrg. 2(1):32–45.Crossref, Google Scholar
Ipeirotis PG, Provost F, Sheng VS, Wang J (2014) Repeated labeling using multiple noisy labelers. Data Mining Knowledge Discovery 28(2):402–441.Crossref, Google Scholar
Kahneman D (1991) Article commentary: Judgment and decision making: A personal view. Psych. Sci. 2(3):142–145.Crossref, Google Scholar
Kahneman D, Slovic SP, Slovic P, Tversky A (1982) Judgment Under Uncertainty: Heuristics and Biases (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Khetan A, Lipton ZC, Anandkumar A (2018) Learning from noisy singly-labeled data. 6th Internat. Conf. Learn. Representations, {ICLR} 2018 (Vancouver), Conf. Track Proc. (OpenReview.net).Google Scholar
Kiruthiga M, Sangeetha P (2016) Improving labeling quality using positive label frequency threshold algorithm. Internat. J. Comput. Sci. Engrg. Comm. 4(6):1467–1473.Google Scholar
Kittur A, Nickerson JV, Bernstein M, Gerber E, Shaw A, Zimmerman J, Lease M, Horton J (2013) The future of crowd work. Proc. 2013 Conf. Comput. Supported Cooperative Work (CSCW ’13) (Association for Computing Machinery, New York), 1301–1318.Google Scholar
Klein GA, Orasanu J, Calderwood R, Zsambok CE (1993) Decision Making in Action: Models and Methods, vol. 3 (Ablex, Norwood, NJ).Google Scholar
Kokkodis M (2021) Dynamic, multidimensional, and skillset-specific reputation systems for online work. Inform. Systems Res. 32(3):688–712.Link, Google Scholar
Kokkodis M, Ipeirotis PG (2016) Reputation transferability in online labor markets. Management Sci. 62(6):1687–1706.Link, Google Scholar
Kuhn GJ (2002) Diagnostic errors. Acad. Emergency Medicine 9(7):740–750.Crossref, Google Scholar
Leape LL, Brennan TA, Laird N, Lawthers AG, Localio AR, Barnes BA, Hebert L, Newhouse JP, Weiler PC, Hiatt H (1991) The nature of adverse events in hospitalized patients: Results of the Harvard medical practice study II. New England J. Medicine 324(6):377–384.Crossref, Google Scholar
Linder JA, Doctor JN, Friedberg MW, Nieva HR, Birks C, Meeker D, Fox CR (2014) Time of day and the decision to prescribe antibiotics. JAMA Internal Medicine 174(12):2029–2031.Crossref, Google Scholar
Milkovich GT, Newman JM, Gerhart B (2014) Compensation (McGraw-Hill, New York).Google Scholar
Nickerson RS (1998) Confirmation bias: A ubiquitous phenomenon in many guises. Rev. General Psych. 2(2):175–220.Crossref, Google Scholar
Norman GR, Rosenthal D, Brooks LR, Allen SW, Muzzin LJ (1989) The development of expertise in dermatology. Arch. Dermatology 125(8):1063–1068.Crossref, Google Scholar
Rodrigues F, Pereira F, Ribeiro B (2013) Learning from multiple annotators: Distinguishing good from random labelers. Pattern Recognition Lett. 34(12):1428–1436.Crossref, Google Scholar
Saar-Tsechansky M, Provost F (2004) Active sampling for class probability estimation and ranking. Machine Learn. 54(2):153–178.Crossref, Google Scholar
Saber Tehrani AS, Lee H, Mathews SC, Shore A, Makary MA, Pronovost PJ, Newman-Toker DE (2013) 25-year summary of US malpractice claims for diagnostic errors 1986–2010: An analysis from the national practitioner data bank. BMJ Quality Safety 22(8):672–680.Crossref, Google Scholar
Schein AI, Popescul A, Ungar LH, Pennock DM (2002) Methods and metrics for cold-start recommendations. Proc. 25th Annual Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (SIGIR ’02) (Association for Computing Machinery, New York), 253–260.Google Scholar
Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin, Madison.Google Scholar
Shanteau J (1992) Competence in experts: The role of task characteristics. Organ. Behav. Human Decision Processes 53(2):252–266.Crossref, Google Scholar
Shanteau J, Weiss DJ, Thomas RP, Pounds JC (2002) Performance-based assessment of expertise: How to decide if someone is an expert or not. Eur. J. Oper. Res. 136(2):253–263.Crossref, Google Scholar
Sheng V, Zhang J, Gu B, Wu X (2017) Majority voting and pairing with multiple noisy labeling. IEEE Trans. Knowledge Data Engrg. 31(7):1355–1368.Crossref, Google Scholar
Singh H, Meyer AND, Thomas EJ (2014) The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving us adult populations. BMJ Quality Safety 23(9):727–731.Crossref, Google Scholar
Smith-Coggins R, Rosekind MR, Hurd S, Buccino KR (1994) Relationship of day vs. night sleep to physician performance and mood. Ann. Emergency Medicine 24(5):928–934.Crossref, Google Scholar
Sussman EJ, Tsiaras WG, Soper KA (1982) Diagnosis of diabetic eye disease. JAMA 247(23):3231–3234.Crossref, Google Scholar
Tanno R, Saeedi A, Sankaranarayanan S, Alexander DC, Silberman N (2019) Learning from noisy labels by regularized estimation of annotator confusion. 2019 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR) (IEEE, Piscataway, NJ), 11236–11245.Google Scholar
Tetlock PE (2017) Expert Political Judgment: How Good Is It? How Can We Know? (Princeton University Press, Princeton, NJ).Crossref, Google Scholar
Tversky A, Kahneman D (1973) Availability: A heuristic for judging frequency and probability. Cognitive Psych. 5(2):207–232.Crossref, Google Scholar
Tversky A, Kahneman D (1974) Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. Science 185(4157):1124–1131.Crossref, Google Scholar
Van Such M, Lohr R, Beckman T, Naessens JM (2017) Extent of diagnostic agreement among medical referrals. J. Evaluation Clinical Practice 23(4):870–874.Crossref, Google Scholar
Wang J, Ipeirotis PG, Provost F (2017) Cost-effective quality assurance in crowd labeling. Inform. Systems Res. 28(1):137–158.Link, Google Scholar
Wang Y, Yao Q, Kwok JT, Ni LM (2020) Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surveys 53(3):1–34.Crossref, Google Scholar
Wang G, Li J, Hopp WJ, Fazzalari FL, Bolling SF (2019) Using patient-specific quality information to unlock hidden healthcare capabilities. Manufacturing Service Oper. Management 21(3):582–601.Link, Google Scholar
Weekley JA, Gier JA (1989) Ceilings in the reliability and validity of performance ratings: The case of expert raters. Acad. Management J. 32(1):213–222.Crossref, Google Scholar
White N, Reid F, Harris A, Harries P, Stone P (2016) A systematic review of predictions of survival in palliative care: How accurate are clinicians and who are the experts? PLoS One 11(8):e0161407.Crossref, Google Scholar
Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Adv. Neural Inform. Processing Systems 22 - Proc. 2009 Conf., vol. 22 (Curran Associates Inc., Red Hook, NY), 2035–2043.Google Scholar
Zhang B, Chen Z, Albert PS (2012) Estimating diagnostic accuracy of raters without a gold standard by exploiting a group of experts. Biometrics 68(4):1294–1302.Crossref, Google Scholar
Zhang H, Xiao Y, Zhao X, Tian Z, Zhang SY, Dong D (2022) Physicians’ knowledge on specific rare diseases and its associated factors: A national cross-sectional study from China. Orphanet J. Rare Diseases 17(1):1–13.Crossref, Google Scholar
Zhou ZH (2018) A brief introduction to weakly supervised learning. National Sci. Rev. 5(1):44–53.Crossref, Google Scholar
Zhou D, Platt JC, Basu S, Mao Y (2012) Learning from the wisdom of crowds by minimax entropy. Proc. 25th Internat. Conf. Neural Inform. Processing Systems - Volume 2 (NIPS’12) (Curran Associates Inc., Red Hook, NY), 2195–2203.Google Scholar
Zwaan L, Singh H (2015) The challenges in defining and measuring diagnostic error. Diagnosis (Berlin) 2(2):97–103.Crossref, Google Scholar

Volume 71, Issue 7

July 2025

Pages iv-vi, 5419-6318

Article Information

Supplemental Material

Metrics

Information

Received:October 06, 2021
Accepted:March 23, 2024
Published Online:October 15, 2024

Cite as

Wanxue Dong, Maytal Saar-Tsechansky, Tomer Geva (2024) A Machine Learning Framework for Assessing Experts’ Decision Quality. Management Science 71(7):5696-5721.

https://doi.org/10.1287/mnsc.2021.03357

Keywords

Acknowledgments

The authors are grateful for excellent comments and suggestions by the department editor, associate editor, and three reviewers. The paper has also greatly benefited from feedback provided by attendees of workshops and conferences, including WITS, SCECR, and INFORMS Data Science Workshop. M. Saar-Tsechansky and T. Geva contributed equally to this work.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A Machine Learning Framework for Assessing Experts’ Decision Quality

References

Volume 71, Issue 7

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News