Diagnosing Model Performance Under Distribution Shift

Published Online:https://doi.org/10.1287/opre.2023.0217

References

  • Amorim E, Cançado M, Veloso A (2018) Automated essay scoring in the presence of biased ratings. Proc. 2018 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. Vol. 1 (Long Papers) (Association for Computational Linguistics, Stroudsburg, PA), 229–237.Google Scholar
  • Arjovsky M, Bottou L, Gulrajani I, Lopez-Paz D (2019) Invariant risk minimization. Preprint, submitted July 5, https://arxiv.org/abs/1907.02893.Google Scholar
  • Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis (Springer, New York).CrossrefGoogle Scholar
  • Bandi P, Geessink O, Manson Q, Van Dijk M, Balkenhol M, Hermsen M, Ehteshami Bejnordi B, et al. (2018) From detection of individual metastases to classification of lymph node status at the patient level: The CAMELYON17 challenge. IEEE Trans. Medical Imaging 38(2):550–560.CrossrefGoogle Scholar
  • Banerjee A, Karlan D, Zinman J (2015) Six randomized evaluations of microcredit: Introduction and further steps. Amer. Econom. J. Appl. Econom. 7(1):1–21.CrossrefGoogle Scholar
  • Beery S, Cole E, Gjoka A (2020) The iWildCam 2020 competition dataset. Preprint, submitted April 21, https://arxiv.org/abs/2004.10340.Google Scholar
  • Ben-David S, Blitzer J, Crammer K, Pereira F (2007) Analysis of representations for domain adaptation. Adv. Neural Inform. Processing Systems 20:137–144.CrossrefGoogle Scholar
  • Ben-Tal A, den Hertog D, Waegenaere AD, Melenberg B, Rennen G (2013) Robust solutions of optimization problems affected by uncertain probabilities. Management Sci. 59(2):341–357.LinkGoogle Scholar
  • Bertsimas D, Gupta V, Kallus N (2018) Data-driven robust optimization. Math. Programming 167(2):235–292.CrossrefGoogle Scholar
  • Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. Proc. 24th Internat. Conf. Machine Learn. (Association for Computing Machinery, New York).Google Scholar
  • Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA (1998) Efficient and Adaptive Estimation for Semiparametric Models (Springer, New York).Google Scholar
  • Blanchet J, Kang Y, Murthy K (2019) Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3):830–857.CrossrefGoogle Scholar
  • Blanchet J, Kang Y, Zhang F, Murthy K (2017) Data-driven optimal transport cost selection for distributionally robust optimization. Preprint, submitted May 19, https://arxiv.org/abs/1705.07152.Google Scholar
  • Buolamwini J, Gebru T (2018) Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Conf. Fairness Accountability Transparency (PMLR, New York), 77–91.Google Scholar
  • Cai TT, Fonseca Y, Hou K, Namkoong H (2024) C-learner: Constrained learning for causal inference and semiparametric statistics. Preprint, submitted May 15, https://arxiv.org/abs/2405.09493.Google Scholar
  • Chattopadhyay A, Hase CH, Zubizarreta JR (2020) Balancing vs modeling approaches to weighting in practice. Statist. Medicine 39(24):3227–3254.CrossrefGoogle Scholar
  • Chen MS, Lara PN, Dang JH, Paterniti DA, Kelly K (2014) Twenty years post-NIH revitalization act: Enhancing minority participation in clinical trials (EMPaCT): Laying the groundwork for improving minority clinical trial accrual: Renewing the case for enhancing minority participation in cancer clinical trials. Cancer 120:1091–1096.CrossrefGoogle Scholar
  • Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M (2020) Ethical machine learning in health care. Preprint, submitted September 22, https://arxiv.org/abs/2009.10576.Google Scholar
  • Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018) Double/debiased machine learning for treatment and structural parameters. Econom. J. 21(1):C1–C68.CrossrefGoogle Scholar
  • Christie G, Fendley N, Wilson J, Mukherjee R (2018) Functional map of the world. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 6172–6180.Google Scholar
  • Chung E, Romano JP (2013) Exact and asymptotically robust permutation tests. Ann. Statist. 41(2):484–507.CrossrefGoogle Scholar
  • Colombo M, Gondzio J, Grothey A (2011) A warm-start approach for large-scale stochastic linear programs. Math. Programming 127(2):371–397.CrossrefGoogle Scholar
  • Cruces G, Galiani S (2007) Fertility and female labor supply in Latin America: New causal evidence. Labour Econom. 14(3):565–573.CrossrefGoogle Scholar
  • Crump RK, Hotz VJ, Imbens GW, Mitnik OA (2006) Moving the goalposts: Addressing limited overlap in the estimation of average treatment effects by changing the estimand. NBER Working Paper No. 0330, National Bureau of Economic Research, Cambridge, MA.Google Scholar
  • Dehejia R, Pop-Eleches C, Samii C (2021) From local to global: External validity in a fertility natural experiment. J. Bus. Econom. Statist. 39(1):217–243.CrossrefGoogle Scholar
  • Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3):595–612.LinkGoogle Scholar
  • Ding F, Hardt M, Miller J, Schmidt L (2021) Retiring adult: New datasets for fair machine learning. Adv. Neural Inform. Processing Systems 34.Google Scholar
  • Duchi JC, Namkoong H (2021) Learning models with uniform performance via distributionally robust optimization. Ann. Statist. 49(3):1378–1406.CrossrefGoogle Scholar
  • Duchi J, Hashimoto T, Namkoong H (2023) Distributionally robust losses for latent covariate mixtures. Oper. Res. 71(2):649–664.LinkGoogle Scholar
  • Efron B (1982) The jackknife, the bootstrap and other resampling plans. CBMS-NSF Regional Conf. Ser. Appl. Math. (Society for Industrial and Applied Mathematics, Philadelphia).Google Scholar
  • Egami N, Hartman E (2023) Elements of external validity: Framework, design, and analysis. Amer. Political Sci. Rev. 117(3):1070–1088.CrossrefGoogle Scholar
  • Eldar YC, Ben-Tal A, Nemirovski A (2004) Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Trans. Signal Processing 52(8):2177–2188.CrossrefGoogle Scholar
  • Elkan C (2001) The foundations of cost-sensitive learning. Internat. Joint Conf. Artificial Intelligence, vol. 17 (Lawrence Erlbaum Associates Ltd., Mahwah, NJ), 973–978.Google Scholar
  • Fisher A, Kennedy EH (2021) Visually communicating and teaching intuition for influence functions. Amer. Statistician 75(2):162–172.CrossrefGoogle Scholar
  • Fogarty CB (2019) Studentized sensitivity analysis for the sample average treatment effect in paired observational studies. J. Amer. Statist. Assoc. 115(531):1518–1530.CrossrefGoogle Scholar
  • Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, March M, Lempitsky V (2016) Domain-adversarial training of neural networks. J. Machine Learn. Res. 17(59):1–35.Google Scholar
  • Gao R, Chen X, Kleywegt A (2017) Wasserstein distributional robustness and regularization in statistical learning. Preprint, submitted December 17, https://arxiv.org/abs/1712.06050.Google Scholar
  • Gertler P, Shah M, Alzua ML, Cameron L, Martinez S, Patil S (2015) How does health promotion work? Evidence from the dirty business of eliminating open defecation. NBER Working Paper No. 20997, National Bureau of Economic Research, Cambridge, MA.Google Scholar
  • Hand DJ (2006) Classifier technology and the illusion of progress. Statist. Sci. 21(1):1–14.CrossrefGoogle Scholar
  • Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, Desai R, et al. (2021) The many faces of robustness: A critical analysis of out-of-distribution generalization. Proc. IEEE/CVF Internat. Conf. Comput. Vision, 8340–8349.Google Scholar
  • Homma T, Saltelli A (1996) Importance measures in global sensitivity analysis of nonlinear models. Reliability Engrg. System Safety 52(1):1–17.CrossrefGoogle Scholar
  • Huang J, Gretton A, Borgwardt KM, Schölkopf B, Smola AJ (2007) Correcting sample selection bias by unlabeled data. Adv. Neural Inform. Processing Systems 20:601–608.CrossrefGoogle Scholar
  • Huyen C (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications (O’Reilly, Beijing).Google Scholar
  • Imai K, Ratkovic M (2014) Covariate balancing propensity score. J. Roy. Statist. Soc. Ser. B Statist. Methodology 76(1):243–263.CrossrefGoogle Scholar
  • Imbens G, Rubin D (2015) Causal Inference for Statistics, Social, and Biomedical Sciences (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Jeong S, Namkoong H (2020) Assessing external validity over worst-case subpopulations. Preprint, submitted July 5, https://arxiv.org/abs/2007.02411.Google Scholar
  • Kallus N, Zhou A (2018) Confounding-robust policy improvement. Adv. Neural Inform. Processing Systems 31:9269–9279.Google Scholar
  • Kallus N, Mao X, Zhou A (2019) Interval estimation of individual-level causal effects under unobserved confounding. Proc. 22nd Internat. Conf. Artificial Intelligence Statist. (PMLR, New York).Google Scholar
  • Kang JDY, Schafer JL (2007) Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22(4):523–539.Google Scholar
  • Kennedy EH (2022) Semiparametric doubly robust targeted double machine learning: A review. Preprint, submitted March 12, https://arxiv.org/abs/2203.06469.Google Scholar
  • Kern HL, Stuart EA, Hill J, Green DP (2016) Assessing methods for generalizing experimental impact estimates to target populations. J. Res. Educational Effectiveness 9(1):103–127.CrossrefGoogle Scholar
  • Koh PW, Sagawa S, Marklund H, Xie SM, Zhang M, Balsubramani A, Hu W, et al. (2020) Wilds: A benchmark of in-the-wild distribution shifts. Preprint, submitted December 14, https://arxiv.org/abs/2012.07421.Google Scholar
  • Kuhn D, Esfahani PM, Nguyen VA, Shafieezadeh-Abadeh S (2019) Wasserstein distributionally robust optimization: Theory and applications in machine learning. Oper. Res. Management Sci. Age Analytics 2019(October):130–166.LinkGoogle Scholar
  • Lam H (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4):1090–1105.AbstractGoogle Scholar
  • Lam H, Zhou E (2015) Quantifying input uncertainty in stochastic optimization. Proc. 2015 Winter Simulation Conf. (IEEE, Piscataway, NJ).Google Scholar
  • Lazaridou A, Kuncoro A, Gribovskaya E, Agrawal D, Liska A, Terzi T, Gimenez M, et al. (2021) Mind the gap: Assessing temporal generalization in neural language models. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genetics 11(10):733–739.CrossrefGoogle Scholar
  • Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR (2017) Generalizing study results: A potential outcomes perspective. Epidemiology 28(4):553–561.CrossrefGoogle Scholar
  • Li F, Morgan KL, Zaslavsky AM (2018) Balancing covariates via propensity score weighting. J. Amer. Statist. Assoc. 113(521):390–400.CrossrefGoogle Scholar
  • Lipton Z, Wang YX, Smola A (2018) Detecting and correcting for label shift with black box predictors. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
  • Meinshausen N, Bühlmann P (2015) Maximin effects in inhomogeneous large-scale data. Ann. Statist. 43(4):1801–1830.CrossrefGoogle Scholar
  • Miller J, Krauth K, Recht B, Schmidt L (2020) The effect of natural distribution shift on question answering models. Internat. Conf. Machine Learn. (PMLR, New York), 6905–6916.Google Scholar
  • Miller J, Taori R, Raghunathan A, Sagawa S, Koh PW, Shankar V, Liang P, Carmon Y, Schmidt L (2021) Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization. Proc. 38th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
  • Miner MA (1945) Cumulative damage in fatigue. J. Appl. Mechanics 12(3):A159–A164.CrossrefGoogle Scholar
  • Miyato T, Maeda S, Koyama M, Nakae K, Ishii S (2015) Distributional smoothing with virtual adversarial training. Preprint, submitted July 2, https://arxiv.org/abs/1507.00677.Google Scholar
  • Newey WK (1990) Semiparametric efficiency bounds. J. Appl. Econometrics 5(2):99–135.CrossrefGoogle Scholar
  • Newey WK (1994) The asymptotic variance of semiparametric estimators. Econometrica 62(6):1349–1382.CrossrefGoogle Scholar
  • Newey WK (1997) Convergence rates and asymptotic normality for series estimators. J. Econom. 79(1):147–168.CrossrefGoogle Scholar
  • Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. Engle RF, McFadden DL, eds. Handbook of Econometrics (Elsevier, Amsterdam), 2111–2245.CrossrefGoogle Scholar
  • Owen AB (2014) Sobol’ indices and Shapley value. SIAM/ASA J. Uncertainty Quantification 2(1):245–251.CrossrefGoogle Scholar
  • Owen AB (2015) Monte Carlo theory, methods, and examples. Accessed September 9, 2025, https://artowen.su.domains/mc/.Google Scholar
  • Paulk M, Curtis B, Chrissis M, Weber C (1993) Capability maturity model, version 1.1. IEEE Software 10(4):18–27.CrossrefGoogle Scholar
  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12:2825–2830.Google Scholar
  • Peters J, Bühlmann P, Meinshausen N (2016) Causal inference by using invariant prediction: Identification and confidence intervals. J. Roy. Statist. Soc. Ser. B Statist. Methodology 78(5):947–1012.CrossrefGoogle Scholar
  • Praestgaard J, Wellner JA (1993) Exchangeably weighted bootstraps of the general empirical process. Ann. Probab. 21(4):2053–2086.CrossrefGoogle Scholar
  • Pyzdek T, Keller PA (2003) Quality Engineering Handbook (CRC Press, Boca Raton, FL).CrossrefGoogle Scholar
  • Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2008) Dataset Shift in Machine Learning (MIT Press, Cambridge, MA).CrossrefGoogle Scholar
  • Recht B, Roelofs R, Schmidt L, Shankar V (2019) Do ImageNet classifiers generalize to ImageNet? Proc. 36th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
  • Robins J, Sued M, Lei-Gomez Q, Rotnitzky A (2007) Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statist. Sci. 22(4):544–559.CrossrefGoogle Scholar
  • Rosenbaum PR (2010) Design of Observational Studies, Springer Series in Statistics (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • Rosenbaum PR (2011) A new u-statistic with superior design sensitivity in matched observational studies. Biometrics 67(3):1017–1027.CrossrefGoogle Scholar
  • Rosenfeld E, Ravikumar P, Risteski A (2021) The risks of invariant risk minimization. Proc. Ninth Internat. Conf. Learn. Representations.Google Scholar
  • Rothenhäusler D, Bühlmann P, Meinshausen N, Peters J (2018) Anchor regression: Heterogeneous data meets causality. Preprint, submitted January 18, https://arxiv.org/abs/1801.06229.Google Scholar
  • Saenko K, Kulis B, Fritz M, Darrell T (2010) Adapting visual category models to new domains. Proc. Eur. Conf. Comput. Vision (Springer, Berlin, Heidelberg), 213–226.Google Scholar
  • Sagawa S, Koh PW, Lee T, Gao I, Xie SM, Shen K, Kumar A, et al. (2021) Extending the wilds benchmark for unsupervised adaptation. Adv. Neural Inform. Processing Systems 21.Google Scholar
  • Sahoo R, Lei L, Wager S (2022) Learning from a biased sample. Preprint, submitted September 5, https://arxiv.org/abs/2209.01754.Google Scholar
  • Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S (2008) Global Sensitivity Analysis: The Primer (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K, Mooij J (2012) On causal and anticausal learning. Proc. 29th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1255–1262.Google Scholar
  • Shafieezadeh-Abadeh S, Esfahani PM, Kuhn D (2015) Distributionally robust logistic regression. Adv. Neural Inform. Processing Systems 28:1576–1584.Google Scholar
  • Shankar V, Dave A, Roelofs R, Ramanan D, Recht B, Schmidt L (2019) Do image classifiers generalize across time? Preprint, submitted June 5, https://arxiv.org/abs/1906.02168.Google Scholar
  • Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Statist. Planning Inference 90(2):227–244.CrossrefGoogle Scholar
  • Song E, Nelson BL, Staum J (2016) Shapley effects for global sensitivity analysis: Theory and computation. SIAM/ASA J. Uncertainty Quantification 4(1):1060–1083.CrossrefGoogle Scholar
  • Staib M, Jegelka S (2019) Distributionally robust optimization and generalization in kernel methods. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 9131–9141.Google Scholar
  • Stone CJ (1982) Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10(4):1040–1053.CrossrefGoogle Scholar
  • Stuart EA, Cole SR, Bradshaw CP, Leaf PJ (2011) The use of propensity scores to assess the generalizability of results from randomized trials. J. Roy. Statist. Soc. Ser. A Statist. Soc. 174(2):369–386.CrossrefGoogle Scholar
  • Sugiyama M, Krauledat M, Müller KR (2007) Covariate shift adaptation by importance weighted cross validation. J. Machine Learn. Res. 8:985–1005.Google Scholar
  • Taori R, Dave A, Shankar V, Carlini N, Recht B, Schmidt L (2020) Measuring robustness to natural distribution shifts in image classification. Adv. Neural Inform. Processing Systems 20.Google Scholar
  • Tipton E (2013) Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. J. Educational Behavioral Statist. 38(3):239–266.CrossrefGoogle Scholar
  • Tipton E, Olsen RB (2018) A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher 47(8):516–524.CrossrefGoogle Scholar
  • Tipton E, Peck LR (2017) A design-based approach to improve external validity in welfare policy evaluations. Evaluation Rev. 41(4):326–356.CrossrefGoogle Scholar
  • Tran D, Liu J, Dusenberry MW, Phan D, Collier M, Ren J, Han K, et al. (2022) Plex: Towards reliability using pretrained large model extensions. Preprint, submitted July 15, https://arxiv.org/abs/2207.07411.Google Scholar
  • Tsuboi Y, Kashima H, Hido S, Bickel S, Sugiyama M (2009) Direct density ratio estimation for large-scale covariate shift adaptation. J. Inform. Processing 17:138–155.CrossrefGoogle Scholar
  • Tsybakov AB (2009) Introduction to Nonparametric Estimation (Springer, New York).CrossrefGoogle Scholar
  • Van Parys BP, Esfahani PM, Kuhn D (2021) From data to decisions: Distributionally robust optimization is optimal. Management Sci. 67(6):3387–3402.LinkGoogle Scholar
  • Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, DeTroyer-Cooley O, Pestrue J, et al. (2021) External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine 181(8):1065–1070.CrossrefGoogle Scholar
  • Wortsman M, Ilharco G, Li M, Kim JW, Hajishirzi H, Farhadi A, Namkoong H, Schmidt L (2021) Robust fine-tuning of zero-shot models. Preprint, submitted September 4, https://arxiv.org/abs/2109.01903.Google Scholar
  • Xu H, Caramanis C, Mannor S (2012) A distributional interpretation of robust optimization. Math. Oper. Res. 37(1):95–110.LinkGoogle Scholar
  • Yadlowsky S (2022) On cross-fitting with plug-in estimators. Accessed September 9, 2025, https://www.syadlowsky.com/blog/semiparametric/2022/10/24/on-cross-fitting-with-plug-in-estimators.html.Google Scholar
  • Yadlowsky S, Fleming S, Shah N, Brunskill E, Wager S (2021) Evaluating treatment prioritization rules via rank-weighted average treatment effects. Preprint, submitted November 15, https://arxiv.org/abs/2111.07966.Google Scholar
  • Yadlowsky S, Namkoong H, Basu S, Duchi J, Tian L (2022) Bounds on the conditional and average treatment effect with unobserved confounding factors. Ann. Statist. 50(5):2587–2615.CrossrefGoogle Scholar
  • Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK (2018) Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Medicine 15(11):e1002683.CrossrefGoogle Scholar
  • Zhang H, Singh H, Ghassemi M, Joshi S (2022) “Why did the model fail?”: Attributing model performance changes to distribution shifts. Preprint, submitted October 19, https://arxiv.org/abs/2210.10769v1.Google Scholar
  • Zhao Q (2019) Covariate balancing propensity score by tailored loss functions. Ann. Statist. 47(2):965–993.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.