Diagnosing Model Performance Under Distribution Shift
Published Online:18 Dec 2025https://doi.org/10.1287/opre.2023.0217
References
- (2018) Automated essay scoring in the presence of biased ratings. Proc. 2018 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. Vol. 1 (Long Papers) (Association for Computational Linguistics, Stroudsburg, PA), 229–237.Google Scholar
- (2019) Invariant risk minimization. Preprint, submitted July 5, https://arxiv.org/abs/1907.02893.Google Scholar
- (2007) Stochastic Simulation: Algorithms and Analysis (Springer, New York).Crossref, Google Scholar
- (2018) From detection of individual metastases to classification of lymph node status at the patient level: The CAMELYON17 challenge. IEEE Trans. Medical Imaging 38(2):550–560.Crossref, Google Scholar
- (2015) Six randomized evaluations of microcredit: Introduction and further steps. Amer. Econom. J. Appl. Econom. 7(1):1–21.Crossref, Google Scholar
- (2020) The iWildCam 2020 competition dataset. Preprint, submitted April 21, https://arxiv.org/abs/2004.10340.Google Scholar
- (2007) Analysis of representations for domain adaptation. Adv. Neural Inform. Processing Systems 20:137–144.Crossref, Google Scholar
- (2013) Robust solutions of optimization problems affected by uncertain probabilities. Management Sci. 59(2):341–357.Link, Google Scholar
- (2018) Data-driven robust optimization. Math. Programming 167(2):235–292.Crossref, Google Scholar
- (2007) Discriminative learning for differing training and test distributions. Proc. 24th Internat. Conf. Machine Learn. (Association for Computing Machinery, New York).Google Scholar
- (1998) Efficient and Adaptive Estimation for Semiparametric Models (Springer, New York).Google Scholar
- (2019) Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3):830–857.Crossref, Google Scholar
- (2017) Data-driven optimal transport cost selection for distributionally robust optimization. Preprint, submitted May 19, https://arxiv.org/abs/1705.07152.Google Scholar
- (2018) Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Conf. Fairness Accountability Transparency (PMLR, New York), 77–91.Google Scholar
- (2024) C-learner: Constrained learning for causal inference and semiparametric statistics. Preprint, submitted May 15, https://arxiv.org/abs/2405.09493.Google Scholar
- (2020) Balancing vs modeling approaches to weighting in practice. Statist. Medicine 39(24):3227–3254.Crossref, Google Scholar
- (2014) Twenty years post-NIH revitalization act: Enhancing minority participation in clinical trials (EMPaCT): Laying the groundwork for improving minority clinical trial accrual: Renewing the case for enhancing minority participation in cancer clinical trials. Cancer 120:1091–1096.Crossref, Google Scholar
- (2020) Ethical machine learning in health care. Preprint, submitted September 22, https://arxiv.org/abs/2009.10576.Google Scholar
- (2018) Double/debiased machine learning for treatment and structural parameters. Econom. J. 21(1):C1–C68.Crossref, Google Scholar
- (2018) Functional map of the world. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 6172–6180.Google Scholar
- (2013) Exact and asymptotically robust permutation tests. Ann. Statist. 41(2):484–507.Crossref, Google Scholar
- (2011) A warm-start approach for large-scale stochastic linear programs. Math. Programming 127(2):371–397.Crossref, Google Scholar
- (2007) Fertility and female labor supply in Latin America: New causal evidence. Labour Econom. 14(3):565–573.Crossref, Google Scholar
- (2006) Moving the goalposts: Addressing limited overlap in the estimation of average treatment effects by changing the estimand. NBER Working Paper No. 0330, National Bureau of Economic Research, Cambridge, MA.Google Scholar
- (2021) From local to global: External validity in a fertility natural experiment. J. Bus. Econom. Statist. 39(1):217–243.Crossref, Google Scholar
- (2010) Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3):595–612.Link, Google Scholar
- (2021) Retiring adult: New datasets for fair machine learning. Adv. Neural Inform. Processing Systems 34.Google Scholar
- (2021) Learning models with uniform performance via distributionally robust optimization. Ann. Statist. 49(3):1378–1406.Crossref, Google Scholar
- (2023) Distributionally robust losses for latent covariate mixtures. Oper. Res. 71(2):649–664.Link, Google Scholar
- (1982) The jackknife, the bootstrap and other resampling plans. CBMS-NSF Regional Conf. Ser. Appl. Math. (Society for Industrial and Applied Mathematics, Philadelphia).Google Scholar
- (2023) Elements of external validity: Framework, design, and analysis. Amer. Political Sci. Rev. 117(3):1070–1088.Crossref, Google Scholar
- (2004) Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Trans. Signal Processing 52(8):2177–2188.Crossref, Google Scholar
- (2001) The foundations of cost-sensitive learning. Internat. Joint Conf. Artificial Intelligence, vol. 17 (Lawrence Erlbaum Associates Ltd., Mahwah, NJ), 973–978.Google Scholar
- (2021) Visually communicating and teaching intuition for influence functions. Amer. Statistician 75(2):162–172.Crossref, Google Scholar
- (2019) Studentized sensitivity analysis for the sample average treatment effect in paired observational studies. J. Amer. Statist. Assoc. 115(531):1518–1530.Crossref, Google Scholar
- (2016) Domain-adversarial training of neural networks. J. Machine Learn. Res. 17(59):1–35.Google Scholar
- (2017) Wasserstein distributional robustness and regularization in statistical learning. Preprint, submitted December 17, https://arxiv.org/abs/1712.06050.Google Scholar
- (2015) How does health promotion work? Evidence from the dirty business of eliminating open defecation. NBER Working Paper No. 20997, National Bureau of Economic Research, Cambridge, MA.Google Scholar
- (2006) Classifier technology and the illusion of progress. Statist. Sci. 21(1):1–14.Crossref, Google Scholar
- (2021) The many faces of robustness: A critical analysis of out-of-distribution generalization. Proc. IEEE/CVF Internat. Conf. Comput. Vision, 8340–8349.Google Scholar
- (1996) Importance measures in global sensitivity analysis of nonlinear models. Reliability Engrg. System Safety 52(1):1–17.Crossref, Google Scholar
- (2007) Correcting sample selection bias by unlabeled data. Adv. Neural Inform. Processing Systems 20:601–608.Crossref, Google Scholar
- (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications (O’Reilly, Beijing).Google Scholar
- (2014) Covariate balancing propensity score. J. Roy. Statist. Soc. Ser. B Statist. Methodology 76(1):243–263.Crossref, Google Scholar
- (2015) Causal Inference for Statistics, Social, and Biomedical Sciences (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2020) Assessing external validity over worst-case subpopulations. Preprint, submitted July 5, https://arxiv.org/abs/2007.02411.Google Scholar
- (2018) Confounding-robust policy improvement. Adv. Neural Inform. Processing Systems 31:9269–9279.Google Scholar
- (2019) Interval estimation of individual-level causal effects under unobserved confounding. Proc. 22nd Internat. Conf. Artificial Intelligence Statist. (PMLR, New York).Google Scholar
- (2007) Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22(4):523–539.Google Scholar
- (2022) Semiparametric doubly robust targeted double machine learning: A review. Preprint, submitted March 12, https://arxiv.org/abs/2203.06469.Google Scholar
- (2016) Assessing methods for generalizing experimental impact estimates to target populations. J. Res. Educational Effectiveness 9(1):103–127.Crossref, Google Scholar
- (2020) Wilds: A benchmark of in-the-wild distribution shifts. Preprint, submitted December 14, https://arxiv.org/abs/2012.07421.Google Scholar
- (2019) Wasserstein distributionally robust optimization: Theory and applications in machine learning. Oper. Res. Management Sci. Age Analytics 2019(October):130–166.Link, Google Scholar
- (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4):1090–1105.Abstract, Google Scholar
- (2015) Quantifying input uncertainty in stochastic optimization. Proc. 2015 Winter Simulation Conf. (IEEE, Piscataway, NJ).Google Scholar
- (2021) Mind the gap: Assessing temporal generalization in neural language models. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genetics 11(10):733–739.Crossref, Google Scholar
- (2017) Generalizing study results: A potential outcomes perspective. Epidemiology 28(4):553–561.Crossref, Google Scholar
- (2018) Balancing covariates via propensity score weighting. J. Amer. Statist. Assoc. 113(521):390–400.Crossref, Google Scholar
- (2018) Detecting and correcting for label shift with black box predictors. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
- (2015) Maximin effects in inhomogeneous large-scale data. Ann. Statist. 43(4):1801–1830.Crossref, Google Scholar
- (2020) The effect of natural distribution shift on question answering models. Internat. Conf. Machine Learn. (PMLR, New York), 6905–6916.Google Scholar
- (2021) Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization. Proc. 38th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
- (1945) Cumulative damage in fatigue. J. Appl. Mechanics 12(3):A159–A164.Crossref, Google Scholar
- (2015) Distributional smoothing with virtual adversarial training. Preprint, submitted July 2, https://arxiv.org/abs/1507.00677.Google Scholar
- (1990) Semiparametric efficiency bounds. J. Appl. Econometrics 5(2):99–135.Crossref, Google Scholar
- (1994) The asymptotic variance of semiparametric estimators. Econometrica 62(6):1349–1382.Crossref, Google Scholar
- (1997) Convergence rates and asymptotic normality for series estimators. J. Econom. 79(1):147–168.Crossref, Google Scholar
- (1994) Large sample estimation and hypothesis testing. Engle RF, McFadden DL, eds. Handbook of Econometrics (Elsevier, Amsterdam), 2111–2245.Crossref, Google Scholar
- (2014) Sobol’ indices and Shapley value. SIAM/ASA J. Uncertainty Quantification 2(1):245–251.Crossref, Google Scholar
- (2015) Monte Carlo theory, methods, and examples. Accessed September 9, 2025, https://artowen.su.domains/mc/.Google Scholar
- (1993) Capability maturity model, version 1.1. IEEE Software 10(4):18–27.Crossref, Google Scholar
- (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12:2825–2830.Google Scholar
- (2016) Causal inference by using invariant prediction: Identification and confidence intervals. J. Roy. Statist. Soc. Ser. B Statist. Methodology 78(5):947–1012.Crossref, Google Scholar
- (1993) Exchangeably weighted bootstraps of the general empirical process. Ann. Probab. 21(4):2053–2086.Crossref, Google Scholar
- (2003) Quality Engineering Handbook (CRC Press, Boca Raton, FL).Crossref, Google Scholar
- (2008) Dataset Shift in Machine Learning (MIT Press, Cambridge, MA).Crossref, Google Scholar
- (2019) Do ImageNet classifiers generalize to ImageNet? Proc. 36th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
- (2007) Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statist. Sci. 22(4):544–559.Crossref, Google Scholar
- (2010) Design of Observational Studies, Springer Series in Statistics (Springer, Cham, Switzerland).Crossref, Google Scholar
- (2011) A new u-statistic with superior design sensitivity in matched observational studies. Biometrics 67(3):1017–1027.Crossref, Google Scholar
- (2021) The risks of invariant risk minimization. Proc. Ninth Internat. Conf. Learn. Representations.Google Scholar
- (2018) Anchor regression: Heterogeneous data meets causality. Preprint, submitted January 18, https://arxiv.org/abs/1801.06229.Google Scholar
- (2010) Adapting visual category models to new domains. Proc. Eur. Conf. Comput. Vision (Springer, Berlin, Heidelberg), 213–226.Google Scholar
- (2021) Extending the wilds benchmark for unsupervised adaptation. Adv. Neural Inform. Processing Systems 21.Google Scholar
- (2022) Learning from a biased sample. Preprint, submitted September 5, https://arxiv.org/abs/2209.01754.Google Scholar
- (2008) Global Sensitivity Analysis: The Primer (John Wiley & Sons, Hoboken, NJ).Google Scholar
- (2012) On causal and anticausal learning. Proc. 29th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1255–1262.Google Scholar
- (2015) Distributionally robust logistic regression. Adv. Neural Inform. Processing Systems 28:1576–1584.Google Scholar
- (2019) Do image classifiers generalize across time? Preprint, submitted June 5, https://arxiv.org/abs/1906.02168.Google Scholar
- (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Statist. Planning Inference 90(2):227–244.Crossref, Google Scholar
- (2016) Shapley effects for global sensitivity analysis: Theory and computation. SIAM/ASA J. Uncertainty Quantification 4(1):1060–1083.Crossref, Google Scholar
- (2019) Distributionally robust optimization and generalization in kernel methods. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 9131–9141.Google Scholar
- (1982) Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10(4):1040–1053.Crossref, Google Scholar
- (2011) The use of propensity scores to assess the generalizability of results from randomized trials. J. Roy. Statist. Soc. Ser. A Statist. Soc. 174(2):369–386.Crossref, Google Scholar
- (2007) Covariate shift adaptation by importance weighted cross validation. J. Machine Learn. Res. 8:985–1005.Google Scholar
- (2020) Measuring robustness to natural distribution shifts in image classification. Adv. Neural Inform. Processing Systems 20.Google Scholar
- (2013) Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. J. Educational Behavioral Statist. 38(3):239–266.Crossref, Google Scholar
- (2018) A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher 47(8):516–524.Crossref, Google Scholar
- (2017) A design-based approach to improve external validity in welfare policy evaluations. Evaluation Rev. 41(4):326–356.Crossref, Google Scholar
- (2022) Plex: Towards reliability using pretrained large model extensions. Preprint, submitted July 15, https://arxiv.org/abs/2207.07411.Google Scholar
- (2009) Direct density ratio estimation for large-scale covariate shift adaptation. J. Inform. Processing 17:138–155.Crossref, Google Scholar
- (2009) Introduction to Nonparametric Estimation (Springer, New York).Crossref, Google Scholar
- (2021) From data to decisions: Distributionally robust optimization is optimal. Management Sci. 67(6):3387–3402.Link, Google Scholar
- (2021) External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine 181(8):1065–1070.Crossref, Google Scholar
- (2021) Robust fine-tuning of zero-shot models. Preprint, submitted September 4, https://arxiv.org/abs/2109.01903.Google Scholar
- (2012) A distributional interpretation of robust optimization. Math. Oper. Res. 37(1):95–110.Link, Google Scholar
- (2022) On cross-fitting with plug-in estimators. Accessed September 9, 2025, https://www.syadlowsky.com/blog/semiparametric/2022/10/24/on-cross-fitting-with-plug-in-estimators.html.Google Scholar
- (2021) Evaluating treatment prioritization rules via rank-weighted average treatment effects. Preprint, submitted November 15, https://arxiv.org/abs/2111.07966.Google Scholar
- (2022) Bounds on the conditional and average treatment effect with unobserved confounding factors. Ann. Statist. 50(5):2587–2615.Crossref, Google Scholar
- (2018) Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Medicine 15(11):e1002683.Crossref, Google Scholar
- (2022) “Why did the model fail?”: Attributing model performance changes to distribution shifts. Preprint, submitted October 19, https://arxiv.org/abs/2210.10769v1.Google Scholar
- (2019) Covariate balancing propensity score by tailored loss functions. Ann. Statist. 47(2):965–993.Crossref, Google Scholar

