Optimized Scoring Systems: Toward Trust in Machine Learning for Healthcare and Criminal Justice

Cynthia Rudin
Corresponding Author
Cynthia Rudin
http://orcid.org/0000-0003-4283-2780
Departments of Computer Science, Electrical and Computer Engineering, and Statistical Science, Duke University, Durham, North Carolina 27708;
Search for more papers by this author
,
Berk Ustun
Berk Ustun
Center for Research in Computation for Society, Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138
Search for more papers by this author

Corresponding Author

Cynthia Rudin

Departments of Computer Science, Electrical and Computer Engineering, and Statistical Science, Duke University, Durham, North Carolina 27708;

Search for more papers by this author

Berk Ustun

Center for Research in Computation for Society, Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138

Search for more papers by this author

Published Online:3 Oct 2018https://doi.org/10.1287/inte.2018.0957

References

American Psychiatric Association (2013) Diagnostic and Statistical Manual of Mental Disorders (DSM-5) (American Psychiatric Association Publishing, Washington, DC).Crossref, Google Scholar
Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2017) Learning certifiably optimal rule lists for categorical data. Proc. 23rd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 35–44.Google Scholar
Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2018) Certifiably optimal rule lists for categorical data. J. Machine Learn. Res. 18:1–78.Google Scholar
Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias. Accessed January 1, 2018, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.Google Scholar
Antman EM, Cohen M, Bernink PJ, McCabe CH, Horacek T, Papuchis G, Mautner B, Corbalan R, Radley D, Braunwald E (2000) The TIMI risk score for unstable angina/non–ST elevation MI. J. Amer. Medical Assoc. 284(7):835–842.Crossref, Google Scholar
Austin J, Ocker R, Bhati A (2010) Kentucky pretrial risk assessment instrument validation. Bureau of Justice Statistics. (October), https://www.ncjrs.gov/App/Publications/abstract.aspx?ID=267494.Google Scholar
Berk RA, Bleich J (2013) Statistical procedures for forecasting criminal behavior. Criminol. Public Policy 12(3):513–544.Crossref, Google Scholar
Bone R, Balk R, Cerra F, Dellinger R, Fein A, Knaus W, Schein R, Sibbald W, Abrams J, Bernard G, et al.. (1992) American College of Chest Physicians/Society of Critical Care Medicine consensus conference: Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. Critical Care Medicine 20(6):864–874.Crossref, Google Scholar
Breiman L (2001) Statistical modeling: The two cultures. Statist. Sci. 16(3):199–231.Crossref, Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and Regression Trees (CRC Press, Boca Raton, FL).Google Scholar
Burgess EW (1928) Factors determining success or failure on parole. Bruce AA, Harno AJ, Landesco J, Burgess EW, eds. Parole and the Indeterminate Sentence: A Report to the Chairman of the Parole Board of Illinois on “The Workings of the Indeterminate Sentence Law and the Parole System in Illinois” (Committee on the Study of the Workings of the Indeterminate Sentence Law and Parole, Springfield, IL), 205–249.Google Scholar
Bushway SD (2013) Is there any logic to using logit. Criminology Public Policy 12(3):563–567.Crossref, Google Scholar
Caruana R, Niculescu-Mizil A (2004) Data mining in metric space: An empirical analysis of supervised learning performance criteria. Proc. 10th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 69–78.Crossref, Google Scholar
Chen C, Rudin C (2018) An optimization approach to learning falling rule lists. Storkey A, Perez-Cruz F, eds. Proc. Artificial Intelligence Statistics (AISTATS) (PMLR, Fort Lauderdale, FL), 604–612.Google Scholar
Chung F, Yegneswaran B, Liao P, Chung SA, Vairavanathan S, Islam S, Khajehdehi A, Shapiro CM (2008) Stop questionnaire: A tool to screen patients for obstructive sleep apnea. Anesthesiology 108(5):812–821.Crossref, Google Scholar
Citron D (2016) (Un)fairness of risk scores in criminal sentencing. Forbes (January 13), https://www.forbes.com/sites/daniellecitron/2016/07/13/unfairness-of-risk-scores-in-criminal-sentencing/#10d06e974ad2.Google Scholar
Combs D, Shetty S, Parthasarathy S (2016) Big-data or slim-data: Predictive analytics will rule with world. J. Clinical Sleep Medicine 12(2):159–160.Crossref, Google Scholar
Czeisler BM, Claassen J (2017) A novel clinical score to assess seizure risk. JAMA Neurology 74(12):1395–1396.Crossref, Google Scholar
Danziger S, Levav J, Avnaim-Pesso L (2011) Extraneous factors in judicial decisions. Proc. Natl. Acad. Sci. USA 108(17):6889–6892.Crossref, Google Scholar
Dawes RM (1979) The robust beauty of improper linear models in decision making. Amer. Psychol. 34(7):571–582.Crossref, Google Scholar
Ertekin Ş, Rudin C (2015) A Bayesian approach to learning scoring systems. Big Data 3(4):267–276.Crossref, Google Scholar
Fisher A, Rudin C, Dominici F (2018) Model class reliance: Variable importance measures for any machine learning model class, from the “Rashomon” perspective. Working paper, Cornell University, Ithaca, New York.Google Scholar
Freitas AA (2014) Comprehensible classification models: A position paper. ACM SIGKDD Explorations Newsletter 15(1):1–10.Crossref, Google Scholar
Gage BF, Waterman AD, Shannon W, Boechler M, Rich MW, Radford MJ (2001) Validation of clinical classification schemes for predicting stroke. J. Amer. Medical Assoc. 285(22):2864–2870.Crossref, Google Scholar
Goh ST, Rudin C (2014) Box drawings for learning with imbalanced data. Proc. 20th ACM SIGKDD Conf. Knowledge Discovery Data Mining (KDD) (ACM, New York), 333–342.Crossref, Google Scholar
Goodman B, Flaxman S (2016) European Union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine 38(3):arXiv:1606.08813 [stat.ML].Google Scholar
Gottfredson DM, Snyder HN (2005) The Mathematics of Risk Classification: Changing Data into Valid Instruments for Juvenile Courts (Department of Justice, Office of Juvenile Justice and Delinquency Prevention, Washington, DC).Google Scholar
Hanson R, Thornton D (2003) Notes on the Development of Static-2002 (Department of the Solicitor General of Canada, Ottawa, Ontario).Google Scholar
Ho V (2017) Miscalculated score said to be behind release of alleged twin peaks killer. SFGate, San Francisco Chronicle (August 14), https://www.sfgate.com/crime/article/Miscalculated-score-said-to-be-behind-11818814.php.Google Scholar
Hoffman PB (1994) Twenty years of operational use of a risk prediction instrument: The United States parole commission’s salient factor score. J. Criminal Justice 22(6):477–494.Crossref, Google Scholar
Hoffman PB, Adelberg S (1980) The salient factor score: A nontechnical overview. Federal Probation 44(1):44–52.Google Scholar
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Machine Learn. 11(1):63–90.Crossref, Google Scholar
Howard P, Francis B, Soothill K, Humphreys L (2009) OGRS 3: The revised offender group reconviction scale, Technical Report (Ministry of Justice, London).Google Scholar
ILOG (2007) CPLEX 11.0 User’s Manual (IBM, New York).Google Scholar
Johns MW (1991) A new method for measuring daytime sleepiness: The Epworth sleepiness scale. Sleep 14(6):540–545.Crossref, Google Scholar
Kahneman D (2013) Thinking, Fast and Slow (Farrar, Straus and Giroux, New York).Google Scholar
Knaus WA, Draper EA, Wagner DP, Zimmerman JE (1985) Apache II: A severity of disease classification system. Critical Care Medicine 13(10):818–829.Crossref, Google Scholar
Knaus WA, Zimmerman JE, Wagner DP, Draper EA, Lawrence DE (1981) Apache-acute physiology and chronic health evaluation: A physiologically based classification system. Critical Care Medicine 9(8):591–597.Crossref, Google Scholar
Knaus WA, Wagner D, Draper E, Zimmerman J, Bergner M, Bastos P, Sirio C, Murphy D, Lotring T, Damiano A (1991) The Apache III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults. Chest J. 100(6):1619–1636.Crossref, Google Scholar
Kodratoff Y (1994) The comprehensibility manifesto. KDD Nugget Newsletter (IOS Press, Amsterdam, Netherlands), 83–85.Google Scholar
Lakkaraju H, Rudin C (2017) Learning cost-effective and interpretable treatment regimes. Proc. 20th Internat. Conf. Artificial Intelligence Statistics (PMLR, Fort Lauderdale, FL), 166–175.Google Scholar
Latessa E, Smith P, Lemke R, Makarios M, Lowenkamp C (2009) Creation and validation of the Ohio risk assessment system: Final report. Center for criminal justice research, school of criminal justice (University of Cincinnati, Cincinnati, OH), http://www.ocjs.ohio.gov/ORAS_FinalReport.pdf.Google Scholar
Le Gall JR, Lemeshow S, Saulnier F (1993) A new simplified acute physiology score (SAPS II) based on a European/North American multicenter study. J. Amer. Medical Assoc. 270(24):2957–2963.Crossref, Google Scholar
Letham B, Rudin C, McCormick TH, Madigan D (2015) Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Ann. Appl. Statist. 9(3):1350–1371.Crossref, Google Scholar
Li O, Liu H, Chen C, Rudin C (2018) Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. Proc. 32nd AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 1–8.Google Scholar
Moreno RP, Metnitz PG, Almeida E, Jordan B, Bauer P, Campos RA, Iapichino G, Edbrooke D, Capuzzo M, Le Gall JR (2005) SAPS 3-from evaluation of the patient to evaluation of the intensive care unit. Part 2: Development of a prognostic model for hospital mortality at ICU admission. Intensive Care Medicine 31(10):1345–1355.Crossref, Google Scholar
Northpointe (2015) Correctional offender management profiling for alternative sanctions (COMPAS). Accessed January 1, 2018, http://www.northpointeinc.com/files/technical_documents/FieldGuide2_081412.pdf.Google Scholar
Pazzani MJ (2000) Knowledge discovery from data? Intelligent systems and their applications. IEEE 15(2):10–12.Google Scholar
Pekkala T, Hall A, Lötjönen J, Mattila J, Soininen H, Ngandu T, Laatikainen T, Kivipelto M, Solomon A (2017) Development of a late-life dementia prediction index with supervised machine learning in the population-based CAIDE study. J. Alzheimer’s Disease 55(3):1055–1067.Crossref, Google Scholar
Pennsylvania Commission on Sentencing (2012) Risk/Needs Assessment Project Interim Report 4: Development of Risk Assessment Scale (Pennsylvania Commission on Sentencing, State College, PA).Google Scholar
Shah N, Steyerberg E, Kent D (2018) Big data and predictive analytics: Recalibrating expectations. J. Amer. Medical Assoc. 320(1):27–28.Crossref, Google Scholar
Shaw P, Ahn K, Rapoport JL (2017) Good news for screening for adult attention-deficit/hyperactivity disorder. JAMA Psychiatry 74(5):527.Crossref, Google Scholar
Six A, Backus B, Kelder J (2008) Chest pain in the emergency room: Value of the heart score. Netherlands Heart J. 16(6):191–196.Crossref, Google Scholar
Souillard-Mandar W, Davis R, Rudin C, Au R, Libon DJ, Swenson R, Price CC, Lamar M, Penney DL (2016) Learning classification models of cognitive conditions from subtle behaviors in the digital clock drawing test. Machine Learn. 102(3):393–441.Crossref, Google Scholar
Struck AF, Ustun B, Rodriguez Ruiz A, Lee JW, LaRoche S, Hirsch LJ, Gilmore EJ, Rudin C, Westover BM (2017) A practical risk score for EEG seizures in hospitalized patients. JAMA Neurology 74(12):1419–1424.Crossref, Google Scholar
Than M, Flaws D, Sanders S, Doust J, Glasziou P, Kline J, Aldous S, Troughton R, Reid C, Parsonage WA, et al.. (2014) Development and validation of the emergency department assessment of chest pain score and 2 h accelerated diagnostic protocol. Emergency Medicine Australasia 26(1):34–44.Crossref, Google Scholar
Tollenaar N, van der Heijden P (2013) Which method predicts recidivism best? A comparison of statistical, machine learning and data mining predictive models. J. Royal Statist. Soc. Ser. A 176(2):565–584.Crossref, Google Scholar
U.S. Department of Justice, Bureau of Justice Statistics (2014) Recidivism of prisoners released in 1994. Accessed January 1, 2018, http://doi.org/10.3886/ICPSR03355.v8.Google Scholar
U.S. Sentencing Commission (1987) 2012 guidelines manual: Chapter four - criminal history and criminal livelihood. Accessed January 1, 2018, http://www.ussc.gov/guidelines-_manual/2012/2012-_4a11.Google Scholar
U.S. Sentencing Commission (2004) Measuring recidivism: The criminal history computation of the federal sentencing guidelines. Accessed January 1, 2018, https://www.ussc.gov/sites/default/files/pdf/research-and-publications/research-publications/2004/200405_Recidivism_Criminal_History.pdf.Google Scholar
Ustun B, Rudin C (2016a) Learning optimized risk scores for large-scale datasets. arXiv:1610.00168.Google Scholar
Ustun B, Rudin C (2016b) Supersparse linear integer models for optimized medical scoring systems. Machine Learn. 102(3):349–391.Crossref, Google Scholar
Ustun B, Rudin C (2017) Optimized risk scores. Proc. 23rd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1125–1134.Crossref, Google Scholar
Ustun B, Westover MB, Rudin C, Bianchi MT (2016) Clinical prediction models for sleep apnea: The importance of medical history over symptoms. J. Clinical Sleep Medicine 12(2):161–168.Crossref, Google Scholar
Ustun B, Adler LA, Rudin C, Faraone SV, Spencer TJ, Berglund P, Gruber MJ, Kessler RC (2017) The World Health Organization adult attention-deficit/hyperactivity disorder self-report screening scale for DSM-5. JAMA Psychiatry 74(5):520–526.Crossref, Google Scholar
Wang F, Rudin C (2015) Falling rule lists. Proc. 18th Internat. Conf. Artificial Intelligence Statistics (AISTATS), May 9–12, San Diego, CA.Google Scholar
Wang T, Rudin C, Doshi F, Liu Y, Klampfl E, MacNeille P (2016) Bayesian or’s of and’s for interpretable classification with application to context aware recommender systems. Lebanon G, Vishwanathan SVN, eds. Internat. Conf. Data Mining (ICDM) (PMLR, Fort Lauderdale, FL), arXiv:1504.07614 [cs.LG].Google Scholar
Wang T, Rudin C, Doshi-Velez F, Liu Y, Klampfl E, MacNeille P (2017) A Bayesian framework for learning rule sets for interpretable classification. J. Machine Learn. Res. 18(70):1–37. Google Scholar
Weathers FW, Litz BT, Keane TM, Palmieri PA, Marx BP, Schnurr PP (2013) The PTSD checklist for DSM-5 (pcl-5). National Center for PTSD, http://www.ptsd.va.gov.Google Scholar
Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N (2017) Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS One. Accessed June 1, 2017, http://journals.plos.org/plosone/article/authors?id=10.1371/journal.pone.0174944.Google Scholar
Wexler R (2017a) Code of silence: How private companies hide flaws in the software that governments use to decide who goes to prison and who gets out. Washington Monthly, https://washingtonmonthly.com/magazine/junejulyaugust-2017/code-of-silence/.Google Scholar
Wexler R (2017b) When a computer program keeps you in jail: How computers are harming criminal justice. New York Times (June 13), https://www.nytimes.com/2017/06/13/opinion/how-computers-are-harming-criminal-justice.html.Google Scholar
Wolsey LA (1998) Integer Programming, Vol. 42 (Wiley, New York).Google Scholar
Yang H, Rudin C, Seltzer M (2017) Scalable Bayesian rule lists. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (ICML) (PMLR, Fort Lauderdale, FL), 3921–3930.Google Scholar
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. Proc. 8th ACM SIGKDD Internat. Conf. on Knowledge Discovery Data Mining (ACM, New York), 694–699.Crossref, Google Scholar
Zeng J, Ustun B, Rudin C (2017) Interpretable classification models for recidivism prediction. J. Royal Statist. Soc. Ser. A 180(3):689–722.Crossref, Google Scholar

Volume 48, Issue 5

Special Issue: 2017 Daniel H. Wagner Prize for Excellence in Operations Research Practice

September-October 2018

Pages 399-486, C3

Article Information

Metrics

Information

Published Online:October 03, 2018

Cite as

Cynthia Rudin, Berk Ustun (2018) Optimized Scoring Systems: Toward Trust in Machine Learning for Healthcare and Criminal Justice. Interfaces 48(5):449-466.

https://doi.org/10.1287/inte.2018.0957

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Optimized Scoring Systems: Toward Trust in Machine Learning for Healthcare and Criminal Justice

References

Volume 48, Issue 5

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News