A Machine Learning Framework for Assessing Experts’ Decision Quality

Published Online:https://doi.org/10.1287/mnsc.2021.03357

References

  • Bodnaruk A, Simonov A (2015) Do financial experts make better investment decisions? J. Financial Intermediation 24(4):514–536.CrossrefGoogle Scholar
  • Bonner SE (1994) A model of the effects of audit task complexity. Accounting Organ. Soc. 19(3):213–234.CrossrefGoogle Scholar
  • Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J. Artificial Intelligence Res. 11(1):131–167.CrossrefGoogle Scholar
  • Burris ER (2012) The risks and rewards of speaking up: Managerial responses to employee voice. Acad. Management J. 55(4):851–875.CrossrefGoogle Scholar
  • Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning. IEEE Trans. Neural Networks 20(3):542–542.CrossrefGoogle Scholar
  • Chen Y, Kash I, Ruberry M, Shnayder V (2011) Decision markets with good incentives. Chen N, Elkind E, Koutsoupias E, eds. Internet and Network Economics. WINE 2011, Lecture Notes in Computer Science, vol. 7090 (Springer Berlin Heidelberg, Berlin, Heidelberg), 72–83.Google Scholar
  • Christoforaki M, Ipeirotis PG (2015) A system for scalable and reliable technical-skill testing in online labor markets. Comput. Networks 90:110–120.CrossrefGoogle Scholar
  • Croskerry P (2003) The importance of cognitive errors in diagnosis and strategies to minimize them. Acad. Medicine 78(8):775–780.CrossrefGoogle Scholar
  • Dai P, Lin CH, Mausam, Weld DS (2013) POMDP-based control of workflows for crowdsourcing. Artificial Intelligence 202:52–85.CrossrefGoogle Scholar
  • Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. Proc. 22nd Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 285–294.Google Scholar
  • Danziger S, Levav J, Avnaim-Pesso L (2011) Extraneous factors in judicial decisions. Proc. Natl. Acad. Sci. USA 108(17):6889–6892.CrossrefGoogle Scholar
  • Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J. Roy. Statist. Soc. Ser. C Appl. Statist. 28(1):20–28.Google Scholar
  • Dekel O, Shamir O (2009) Vox populi: Collecting high-quality labels from a crowd. Proc. 22nd Conf. Comput. Learn. Theory (COLT) (Montreal, Canada), 377–386.Google Scholar
  • Dismukes RK, Berman BA, Loukopoulos L (2017) The Limits of Expertise: Rethinking Pilot Error and the Causes of Airline Accidents (Routledge, London).CrossrefGoogle Scholar
  • Dror IE, Charlton D (2006) Why experts make errors. J. Forensic Identification 56(4):600–616.Google Scholar
  • Dror IE, Charlton D, Péron AE (2006) Contextual information renders experts vulnerable to making erroneous identifications. Forensic Sci. Internat. 156(1):74–78.CrossrefGoogle Scholar
  • Dror IE, Pascual-Leone A, Ramachandran V, Cole J, Della Sala S, Manly T, Mayes A (2011) The paradox of human expertise: Why experts get it wrong. Kapur N, ed. The Paradoxical Brain (Cambridge University Press, Cambridge, UK), 177–188.Google Scholar
  • Dror IE, Peron AE, Hind SL, Charlton D (2005) When emotions get the better of us: The effect of contextual top-down processing on matching fingerprints. Appl. Cognitive Psych. 19(6):799–809.CrossrefGoogle Scholar
  • Fenton-O’Creevy M, Soane E, Nicholson N, Willman P (2011) Thinking, feeling and deciding: The influence of emotions on the decision making and performance of traders. J. Organ. Behav. 32(8):1044–1061.CrossrefGoogle Scholar
  • Ferris GR, Munyon TP, Basik K, Buckley MR (2008) The performance evaluation context: Social, emotional, cognitive, political, and relationship components. Human Resources Management Rev. 18(3):146–163.CrossrefGoogle Scholar
  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Statist. 28(2):337–407.CrossrefGoogle Scholar
  • Gao R, Saar-Tsechansky M (2020) Cost-accuracy aware adaptive labeling for active learning. Proc. AAAI Conf. Artificial Intelligence 34(3):2569–2576.Google Scholar
  • Geva T, Saar-Tsechansky M (2021) Who is a better decision maker? Data-driven expert ranking under unobserved quality. Production Oper. Management 30(1):127–144.CrossrefGoogle Scholar
  • Geva T, Saar-Tsechansky M, Lustiger H (2019) More for less: Adaptive labeling payments in online labor markets. Data Mining Knowledge Discovery 33(6):1625–1673.CrossrefGoogle Scholar
  • Graber M, Gordon R, Franklin N (2002) Reducing diagnostic errors in medicine: What’s the goal? Acad. Medicine 77(10):981–992.CrossrefGoogle Scholar
  • Grote RC (2005) Forced Ranking: Making Performance Management Work (Harvard Business School Press, Boston).Google Scholar
  • Harris MM, Schaubroeck J (1988) A meta-analysis of self-supervisor, self-peer, and peer-supervisor ratings. Personnel Psych. 41(1):43–62.CrossrefGoogle Scholar
  • Hayward RA (2002) Counting deaths due to medical errors. JAMA 288(19):2404.CrossrefGoogle Scholar
  • Hopkins M, Reeber E, Forman G, Suermondt J (1999) Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.Google Scholar
  • Huang SJ, Chen JL, Mu X, Zhou ZH (2017) Cost-effective active learning from diverse labelers. Proc. 26th Internat. Joint Conf. Artificial Intelligence (IJCAI’17) (AAAI Press, Palo Alto, CA), 1879–1885.Google Scholar
  • Hutton RJ, Klein G (1999) Expert decision making. Systems Engrg. 2(1):32–45.CrossrefGoogle Scholar
  • Ipeirotis PG, Provost F, Sheng VS, Wang J (2014) Repeated labeling using multiple noisy labelers. Data Mining Knowledge Discovery 28(2):402–441.CrossrefGoogle Scholar
  • Kahneman D (1991) Article commentary: Judgment and decision making: A personal view. Psych. Sci. 2(3):142–145.CrossrefGoogle Scholar
  • Kahneman D, Slovic SP, Slovic P, Tversky A (1982) Judgment Under Uncertainty: Heuristics and Biases (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Khetan A, Lipton ZC, Anandkumar A (2018) Learning from noisy singly-labeled data. 6th Internat. Conf. Learn. Representations, {ICLR} 2018 (Vancouver), Conf. Track Proc. (OpenReview.net).Google Scholar
  • Kiruthiga M, Sangeetha P (2016) Improving labeling quality using positive label frequency threshold algorithm. Internat. J. Comput. Sci. Engrg. Comm. 4(6):1467–1473.Google Scholar
  • Kittur A, Nickerson JV, Bernstein M, Gerber E, Shaw A, Zimmerman J, Lease M, Horton J (2013) The future of crowd work. Proc. 2013 Conf. Comput. Supported Cooperative Work (CSCW ’13) (Association for Computing Machinery, New York), 1301–1318.Google Scholar
  • Klein GA, Orasanu J, Calderwood R, Zsambok CE (1993) Decision Making in Action: Models and Methods, vol. 3 (Ablex, Norwood, NJ).Google Scholar
  • Kokkodis M (2021) Dynamic, multidimensional, and skillset-specific reputation systems for online work. Inform. Systems Res. 32(3):688–712.LinkGoogle Scholar
  • Kokkodis M, Ipeirotis PG (2016) Reputation transferability in online labor markets. Management Sci. 62(6):1687–1706.LinkGoogle Scholar
  • Kuhn GJ (2002) Diagnostic errors. Acad. Emergency Medicine 9(7):740–750.CrossrefGoogle Scholar
  • Leape LL, Brennan TA, Laird N, Lawthers AG, Localio AR, Barnes BA, Hebert L, Newhouse JP, Weiler PC, Hiatt H (1991) The nature of adverse events in hospitalized patients: Results of the Harvard medical practice study II. New England J. Medicine 324(6):377–384.CrossrefGoogle Scholar
  • Linder JA, Doctor JN, Friedberg MW, Nieva HR, Birks C, Meeker D, Fox CR (2014) Time of day and the decision to prescribe antibiotics. JAMA Internal Medicine 174(12):2029–2031.CrossrefGoogle Scholar
  • Milkovich GT, Newman JM, Gerhart B (2014) Compensation (McGraw-Hill, New York).Google Scholar
  • Nickerson RS (1998) Confirmation bias: A ubiquitous phenomenon in many guises. Rev. General Psych. 2(2):175–220.CrossrefGoogle Scholar
  • Norman GR, Rosenthal D, Brooks LR, Allen SW, Muzzin LJ (1989) The development of expertise in dermatology. Arch. Dermatology 125(8):1063–1068.CrossrefGoogle Scholar
  • Rodrigues F, Pereira F, Ribeiro B (2013) Learning from multiple annotators: Distinguishing good from random labelers. Pattern Recognition Lett. 34(12):1428–1436.CrossrefGoogle Scholar
  • Saar-Tsechansky M, Provost F (2004) Active sampling for class probability estimation and ranking. Machine Learn. 54(2):153–178.CrossrefGoogle Scholar
  • Saber Tehrani AS, Lee H, Mathews SC, Shore A, Makary MA, Pronovost PJ, Newman-Toker DE (2013) 25-year summary of US malpractice claims for diagnostic errors 1986–2010: An analysis from the national practitioner data bank. BMJ Quality Safety 22(8):672–680.CrossrefGoogle Scholar
  • Schein AI, Popescul A, Ungar LH, Pennock DM (2002) Methods and metrics for cold-start recommendations. Proc. 25th Annual Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval (SIGIR ’02) (Association for Computing Machinery, New York), 253–260.Google Scholar
  • Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin, Madison.Google Scholar
  • Shanteau J (1992) Competence in experts: The role of task characteristics. Organ. Behav. Human Decision Processes 53(2):252–266.CrossrefGoogle Scholar
  • Shanteau J, Weiss DJ, Thomas RP, Pounds JC (2002) Performance-based assessment of expertise: How to decide if someone is an expert or not. Eur. J. Oper. Res. 136(2):253–263.CrossrefGoogle Scholar
  • Sheng V, Zhang J, Gu B, Wu X (2017) Majority voting and pairing with multiple noisy labeling. IEEE Trans. Knowledge Data Engrg. 31(7):1355–1368.CrossrefGoogle Scholar
  • Singh H, Meyer AND, Thomas EJ (2014) The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving us adult populations. BMJ Quality Safety 23(9):727–731.CrossrefGoogle Scholar
  • Smith-Coggins R, Rosekind MR, Hurd S, Buccino KR (1994) Relationship of day vs. night sleep to physician performance and mood. Ann. Emergency Medicine 24(5):928–934.CrossrefGoogle Scholar
  • Sussman EJ, Tsiaras WG, Soper KA (1982) Diagnosis of diabetic eye disease. JAMA 247(23):3231–3234.CrossrefGoogle Scholar
  • Tanno R, Saeedi A, Sankaranarayanan S, Alexander DC, Silberman N (2019) Learning from noisy labels by regularized estimation of annotator confusion. 2019 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR) (IEEE, Piscataway, NJ), 11236–11245.Google Scholar
  • Tetlock PE (2017) Expert Political Judgment: How Good Is It? How Can We Know? (Princeton University Press, Princeton, NJ).CrossrefGoogle Scholar
  • Tversky A, Kahneman D (1973) Availability: A heuristic for judging frequency and probability. Cognitive Psych. 5(2):207–232.CrossrefGoogle Scholar
  • Tversky A, Kahneman D (1974) Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. Science 185(4157):1124–1131.CrossrefGoogle Scholar
  • Van Such M, Lohr R, Beckman T, Naessens JM (2017) Extent of diagnostic agreement among medical referrals. J. Evaluation Clinical Practice 23(4):870–874.CrossrefGoogle Scholar
  • Wang J, Ipeirotis PG, Provost F (2017) Cost-effective quality assurance in crowd labeling. Inform. Systems Res. 28(1):137–158.LinkGoogle Scholar
  • Wang Y, Yao Q, Kwok JT, Ni LM (2020) Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surveys 53(3):1–34.CrossrefGoogle Scholar
  • Wang G, Li J, Hopp WJ, Fazzalari FL, Bolling SF (2019) Using patient-specific quality information to unlock hidden healthcare capabilities. Manufacturing Service Oper. Management 21(3):582–601.LinkGoogle Scholar
  • Weekley JA, Gier JA (1989) Ceilings in the reliability and validity of performance ratings: The case of expert raters. Acad. Management J. 32(1):213–222.CrossrefGoogle Scholar
  • White N, Reid F, Harris A, Harries P, Stone P (2016) A systematic review of predictions of survival in palliative care: How accurate are clinicians and who are the experts? PLoS One 11(8):e0161407.CrossrefGoogle Scholar
  • Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Adv. Neural Inform. Processing Systems 22 - Proc. 2009 Conf., vol. 22 (Curran Associates Inc., Red Hook, NY), 2035–2043.Google Scholar
  • Zhang B, Chen Z, Albert PS (2012) Estimating diagnostic accuracy of raters without a gold standard by exploiting a group of experts. Biometrics 68(4):1294–1302.CrossrefGoogle Scholar
  • Zhang H, Xiao Y, Zhao X, Tian Z, Zhang SY, Dong D (2022) Physicians’ knowledge on specific rare diseases and its associated factors: A national cross-sectional study from China. Orphanet J. Rare Diseases 17(1):1–13.CrossrefGoogle Scholar
  • Zhou ZH (2018) A brief introduction to weakly supervised learning. National Sci. Rev. 5(1):44–53.CrossrefGoogle Scholar
  • Zhou D, Platt JC, Basu S, Mao Y (2012) Learning from the wisdom of crowds by minimax entropy. Proc. 25th Internat. Conf. Neural Inform. Processing Systems - Volume 2 (NIPS’12) (Curran Associates Inc., Red Hook, NY), 2195–2203.Google Scholar
  • Zwaan L, Singh H (2015) The challenges in defining and measuring diagnostic error. Diagnosis (Berlin) 2(2):97–103.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.