Amorim E, Cançado M, Veloso A (2018) Automated essay scoring in the presence of biased ratings. Proc. 2018 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. Vol. 1 (Long Papers) (Association for Computational Linguistics, Stroudsburg, PA), 229–237.Google Scholar
Arjovsky M, Bottou L, Gulrajani I, Lopez-Paz D (2019) Invariant risk minimization. Preprint, submitted July 5, https://arxiv.org/abs/1907.02893.Google Scholar
Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis (Springer, New York).Crossref, Google Scholar
Bandi P, Geessink O, Manson Q, Van Dijk M, Balkenhol M, Hermsen M, Ehteshami Bejnordi B, et al. (2018) From detection of individual metastases to classification of lymph node status at the patient level: The CAMELYON17 challenge. IEEE Trans. Medical Imaging 38(2):550–560.Crossref, Google Scholar
Banerjee A, Karlan D, Zinman J (2015) Six randomized evaluations of microcredit: Introduction and further steps. Amer. Econom. J. Appl. Econom. 7(1):1–21.Crossref, Google Scholar
Beery S, Cole E, Gjoka A (2020) The iWildCam 2020 competition dataset. Preprint, submitted April 21, https://arxiv.org/abs/2004.10340.Google Scholar
Ben-David S, Blitzer J, Crammer K, Pereira F (2007) Analysis of representations for domain adaptation. Adv. Neural Inform. Processing Systems 20:137–144.Crossref, Google Scholar
Ben-Tal A, den Hertog D, Waegenaere AD, Melenberg B, Rennen G (2013) Robust solutions of optimization problems affected by uncertain probabilities. Management Sci. 59(2):341–357.Link, Google Scholar
Bertsimas D, Gupta V, Kallus N (2018) Data-driven robust optimization. Math. Programming 167(2):235–292.Crossref, Google Scholar
Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. Proc. 24th Internat. Conf. Machine Learn. (Association for Computing Machinery, New York).Google Scholar
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA (1998) Efficient and Adaptive Estimation for Semiparametric Models (Springer, New York).Google Scholar
Blanchet J, Kang Y, Murthy K (2019) Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3):830–857.Crossref, Google Scholar
Blanchet J, Kang Y, Zhang F, Murthy K (2017) Data-driven optimal transport cost selection for distributionally robust optimization. Preprint, submitted May 19, https://arxiv.org/abs/1705.07152.Google Scholar
Buolamwini J, Gebru T (2018) Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Conf. Fairness Accountability Transparency (PMLR, New York), 77–91.Google Scholar
Cai TT, Fonseca Y, Hou K, Namkoong H (2024) C-learner: Constrained learning for causal inference and semiparametric statistics. Preprint, submitted May 15, https://arxiv.org/abs/2405.09493.Google Scholar
Chattopadhyay A, Hase CH, Zubizarreta JR (2020) Balancing vs modeling approaches to weighting in practice. Statist. Medicine 39(24):3227–3254.Crossref, Google Scholar
Chen MS, Lara PN, Dang JH, Paterniti DA, Kelly K (2014) Twenty years post-NIH revitalization act: Enhancing minority participation in clinical trials (EMPaCT): Laying the groundwork for improving minority clinical trial accrual: Renewing the case for enhancing minority participation in cancer clinical trials. Cancer 120:1091–1096.Crossref, Google Scholar
Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M (2020) Ethical machine learning in health care. Preprint, submitted September 22, https://arxiv.org/abs/2009.10576.Google Scholar
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018) Double/debiased machine learning for treatment and structural parameters. Econom. J. 21(1):C1–C68.Crossref, Google Scholar
Christie G, Fendley N, Wilson J, Mukherjee R (2018) Functional map of the world. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 6172–6180.Google Scholar
Chung E, Romano JP (2013) Exact and asymptotically robust permutation tests. Ann. Statist. 41(2):484–507.Crossref, Google Scholar
Colombo M, Gondzio J, Grothey A (2011) A warm-start approach for large-scale stochastic linear programs. Math. Programming 127(2):371–397.Crossref, Google Scholar
Cruces G, Galiani S (2007) Fertility and female labor supply in Latin America: New causal evidence. Labour Econom. 14(3):565–573.Crossref, Google Scholar
Crump RK, Hotz VJ, Imbens GW, Mitnik OA (2006) Moving the goalposts: Addressing limited overlap in the estimation of average treatment effects by changing the estimand. NBER Working Paper No. 0330, National Bureau of Economic Research, Cambridge, MA.Google Scholar
Dehejia R, Pop-Eleches C, Samii C (2021) From local to global: External validity in a fertility natural experiment. J. Bus. Econom. Statist. 39(1):217–243.Crossref, Google Scholar
Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3):595–612.Link, Google Scholar
Ding F, Hardt M, Miller J, Schmidt L (2021) Retiring adult: New datasets for fair machine learning. Adv. Neural Inform. Processing Systems 34.Google Scholar
Duchi JC, Namkoong H (2021) Learning models with uniform performance via distributionally robust optimization. Ann. Statist. 49(3):1378–1406.Crossref, Google Scholar
Duchi J, Hashimoto T, Namkoong H (2023) Distributionally robust losses for latent covariate mixtures. Oper. Res. 71(2):649–664.Link, Google Scholar
Efron B (1982) The jackknife, the bootstrap and other resampling plans. CBMS-NSF Regional Conf. Ser. Appl. Math. (Society for Industrial and Applied Mathematics, Philadelphia).Google Scholar
Egami N, Hartman E (2023) Elements of external validity: Framework, design, and analysis. Amer. Political Sci. Rev. 117(3):1070–1088.Crossref, Google Scholar
Eldar YC, Ben-Tal A, Nemirovski A (2004) Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Trans. Signal Processing 52(8):2177–2188.Crossref, Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. Internat. Joint Conf. Artificial Intelligence, vol. 17 (Lawrence Erlbaum Associates Ltd., Mahwah, NJ), 973–978.Google Scholar
Fisher A, Kennedy EH (2021) Visually communicating and teaching intuition for influence functions. Amer. Statistician 75(2):162–172.Crossref, Google Scholar
Fogarty CB (2019) Studentized sensitivity analysis for the sample average treatment effect in paired observational studies. J. Amer. Statist. Assoc. 115(531):1518–1530.Crossref, Google Scholar
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, March M, Lempitsky V (2016) Domain-adversarial training of neural networks. J. Machine Learn. Res. 17(59):1–35.Google Scholar
Gao R, Chen X, Kleywegt A (2017) Wasserstein distributional robustness and regularization in statistical learning. Preprint, submitted December 17, https://arxiv.org/abs/1712.06050.Google Scholar
Gertler P, Shah M, Alzua ML, Cameron L, Martinez S, Patil S (2015) How does health promotion work? Evidence from the dirty business of eliminating open defecation. NBER Working Paper No. 20997, National Bureau of Economic Research, Cambridge, MA.Google Scholar
Hand DJ (2006) Classifier technology and the illusion of progress. Statist. Sci. 21(1):1–14.Crossref, Google Scholar
Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, Desai R, et al. (2021) The many faces of robustness: A critical analysis of out-of-distribution generalization. Proc. IEEE/CVF Internat. Conf. Comput. Vision, 8340–8349.Google Scholar
Homma T, Saltelli A (1996) Importance measures in global sensitivity analysis of nonlinear models. Reliability Engrg. System Safety 52(1):1–17.Crossref, Google Scholar
Huang J, Gretton A, Borgwardt KM, Schölkopf B, Smola AJ (2007) Correcting sample selection bias by unlabeled data. Adv. Neural Inform. Processing Systems 20:601–608.Crossref, Google Scholar
Huyen C (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications (O’Reilly, Beijing).Google Scholar
Imai K, Ratkovic M (2014) Covariate balancing propensity score. J. Roy. Statist. Soc. Ser. B Statist. Methodology 76(1):243–263.Crossref, Google Scholar
Imbens G, Rubin D (2015) Causal Inference for Statistics, Social, and Biomedical Sciences (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Jeong S, Namkoong H (2020) Assessing external validity over worst-case subpopulations. Preprint, submitted July 5, https://arxiv.org/abs/2007.02411.Google Scholar
Kallus N, Zhou A (2018) Confounding-robust policy improvement. Adv. Neural Inform. Processing Systems 31:9269–9279.Google Scholar
Kallus N, Mao X, Zhou A (2019) Interval estimation of individual-level causal effects under unobserved confounding. Proc. 22nd Internat. Conf. Artificial Intelligence Statist. (PMLR, New York).Google Scholar
Kang JDY, Schafer JL (2007) Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22(4):523–539.Google Scholar
Kennedy EH (2022) Semiparametric doubly robust targeted double machine learning: A review. Preprint, submitted March 12, https://arxiv.org/abs/2203.06469.Google Scholar
Kern HL, Stuart EA, Hill J, Green DP (2016) Assessing methods for generalizing experimental impact estimates to target populations. J. Res. Educational Effectiveness 9(1):103–127.Crossref, Google Scholar
Koh PW, Sagawa S, Marklund H, Xie SM, Zhang M, Balsubramani A, Hu W, et al. (2020) Wilds: A benchmark of in-the-wild distribution shifts. Preprint, submitted December 14, https://arxiv.org/abs/2012.07421.Google Scholar
Kuhn D, Esfahani PM, Nguyen VA, Shafieezadeh-Abadeh S (2019) Wasserstein distributionally robust optimization: Theory and applications in machine learning. Oper. Res. Management Sci. Age Analytics 2019(October):130–166.Link, Google Scholar
Lam H (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4):1090–1105.Abstract, Google Scholar
Lam H, Zhou E (2015) Quantifying input uncertainty in stochastic optimization. Proc. 2015 Winter Simulation Conf. (IEEE, Piscataway, NJ).Google Scholar
Lazaridou A, Kuncoro A, Gribovskaya E, Agrawal D, Liska A, Terzi T, Gimenez M, et al. (2021) Mind the gap: Assessing temporal generalization in neural language models. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genetics 11(10):733–739.Crossref, Google Scholar
Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR (2017) Generalizing study results: A potential outcomes perspective. Epidemiology 28(4):553–561.Crossref, Google Scholar
Li F, Morgan KL, Zaslavsky AM (2018) Balancing covariates via propensity score weighting. J. Amer. Statist. Assoc. 113(521):390–400.Crossref, Google Scholar
Lipton Z, Wang YX, Smola A (2018) Detecting and correcting for label shift with black box predictors. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
Meinshausen N, Bühlmann P (2015) Maximin effects in inhomogeneous large-scale data. Ann. Statist. 43(4):1801–1830.Crossref, Google Scholar
Miller J, Krauth K, Recht B, Schmidt L (2020) The effect of natural distribution shift on question answering models. Internat. Conf. Machine Learn. (PMLR, New York), 6905–6916.Google Scholar
Miller J, Taori R, Raghunathan A, Sagawa S, Koh PW, Shankar V, Liang P, Carmon Y, Schmidt L (2021) Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization. Proc. 38th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
Miner MA (1945) Cumulative damage in fatigue. J. Appl. Mechanics 12(3):A159–A164.Crossref, Google Scholar
Miyato T, Maeda S, Koyama M, Nakae K, Ishii S (2015) Distributional smoothing with virtual adversarial training. Preprint, submitted July 2, https://arxiv.org/abs/1507.00677.Google Scholar
Newey WK (1990) Semiparametric efficiency bounds. J. Appl. Econometrics 5(2):99–135.Crossref, Google Scholar
Newey WK (1994) The asymptotic variance of semiparametric estimators. Econometrica 62(6):1349–1382.Crossref, Google Scholar
Newey WK (1997) Convergence rates and asymptotic normality for series estimators. J. Econom. 79(1):147–168.Crossref, Google Scholar
Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. Engle RF, McFadden DL, eds. Handbook of Econometrics (Elsevier, Amsterdam), 2111–2245.Crossref, Google Scholar
Owen AB (2014) Sobol’ indices and Shapley value. SIAM/ASA J. Uncertainty Quantification 2(1):245–251.Crossref, Google Scholar
Owen AB (2015) Monte Carlo theory, methods, and examples. Accessed September 9, 2025, https://artowen.su.domains/mc/.Google Scholar
Paulk M, Curtis B, Chrissis M, Weber C (1993) Capability maturity model, version 1.1. IEEE Software 10(4):18–27.Crossref, Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12:2825–2830.Google Scholar
Peters J, Bühlmann P, Meinshausen N (2016) Causal inference by using invariant prediction: Identification and confidence intervals. J. Roy. Statist. Soc. Ser. B Statist. Methodology 78(5):947–1012.Crossref, Google Scholar
Praestgaard J, Wellner JA (1993) Exchangeably weighted bootstraps of the general empirical process. Ann. Probab. 21(4):2053–2086.Crossref, Google Scholar
Pyzdek T, Keller PA (2003) Quality Engineering Handbook (CRC Press, Boca Raton, FL).Crossref, Google Scholar
Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2008) Dataset Shift in Machine Learning (MIT Press, Cambridge, MA).Crossref, Google Scholar
Recht B, Roelofs R, Schmidt L, Shankar V (2019) Do ImageNet classifiers generalize to ImageNet? Proc. 36th Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
Robins J, Sued M, Lei-Gomez Q, Rotnitzky A (2007) Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statist. Sci. 22(4):544–559.Crossref, Google Scholar
Rosenbaum PR (2010) Design of Observational Studies, Springer Series in Statistics (Springer, Cham, Switzerland).Crossref, Google Scholar
Rosenbaum PR (2011) A new u-statistic with superior design sensitivity in matched observational studies. Biometrics 67(3):1017–1027.Crossref, Google Scholar
Rosenfeld E, Ravikumar P, Risteski A (2021) The risks of invariant risk minimization. Proc. Ninth Internat. Conf. Learn. Representations.Google Scholar
Rothenhäusler D, Bühlmann P, Meinshausen N, Peters J (2018) Anchor regression: Heterogeneous data meets causality. Preprint, submitted January 18, https://arxiv.org/abs/1801.06229.Google Scholar
Saenko K, Kulis B, Fritz M, Darrell T (2010) Adapting visual category models to new domains. Proc. Eur. Conf. Comput. Vision (Springer, Berlin, Heidelberg), 213–226.Google Scholar
Sagawa S, Koh PW, Lee T, Gao I, Xie SM, Shen K, Kumar A, et al. (2021) Extending the wilds benchmark for unsupervised adaptation. Adv. Neural Inform. Processing Systems 21.Google Scholar
Sahoo R, Lei L, Wager S (2022) Learning from a biased sample. Preprint, submitted September 5, https://arxiv.org/abs/2209.01754.Google Scholar
Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S (2008) Global Sensitivity Analysis: The Primer (John Wiley & Sons, Hoboken, NJ).Google Scholar
Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K, Mooij J (2012) On causal and anticausal learning. Proc. 29th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1255–1262.Google Scholar
Shafieezadeh-Abadeh S, Esfahani PM, Kuhn D (2015) Distributionally robust logistic regression. Adv. Neural Inform. Processing Systems 28:1576–1584.Google Scholar
Shankar V, Dave A, Roelofs R, Ramanan D, Recht B, Schmidt L (2019) Do image classifiers generalize across time? Preprint, submitted June 5, https://arxiv.org/abs/1906.02168.Google Scholar
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Statist. Planning Inference 90(2):227–244.Crossref, Google Scholar
Song E, Nelson BL, Staum J (2016) Shapley effects for global sensitivity analysis: Theory and computation. SIAM/ASA J. Uncertainty Quantification 4(1):1060–1083.Crossref, Google Scholar
Staib M, Jegelka S (2019) Distributionally robust optimization and generalization in kernel methods. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 9131–9141.Google Scholar
Stone CJ (1982) Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10(4):1040–1053.Crossref, Google Scholar
Stuart EA, Cole SR, Bradshaw CP, Leaf PJ (2011) The use of propensity scores to assess the generalizability of results from randomized trials. J. Roy. Statist. Soc. Ser. A Statist. Soc. 174(2):369–386.Crossref, Google Scholar
Sugiyama M, Krauledat M, Müller KR (2007) Covariate shift adaptation by importance weighted cross validation. J. Machine Learn. Res. 8:985–1005.Google Scholar
Taori R, Dave A, Shankar V, Carlini N, Recht B, Schmidt L (2020) Measuring robustness to natural distribution shifts in image classification. Adv. Neural Inform. Processing Systems 20.Google Scholar
Tipton E (2013) Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. J. Educational Behavioral Statist. 38(3):239–266.Crossref, Google Scholar
Tipton E, Olsen RB (2018) A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher 47(8):516–524.Crossref, Google Scholar
Tipton E, Peck LR (2017) A design-based approach to improve external validity in welfare policy evaluations. Evaluation Rev. 41(4):326–356.Crossref, Google Scholar
Tran D, Liu J, Dusenberry MW, Phan D, Collier M, Ren J, Han K, et al. (2022) Plex: Towards reliability using pretrained large model extensions. Preprint, submitted July 15, https://arxiv.org/abs/2207.07411.Google Scholar
Tsuboi Y, Kashima H, Hido S, Bickel S, Sugiyama M (2009) Direct density ratio estimation for large-scale covariate shift adaptation. J. Inform. Processing 17:138–155.Crossref, Google Scholar
Tsybakov AB (2009) Introduction to Nonparametric Estimation (Springer, New York).Crossref, Google Scholar
Van Parys BP, Esfahani PM, Kuhn D (2021) From data to decisions: Distributionally robust optimization is optimal. Management Sci. 67(6):3387–3402.Link, Google Scholar
Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, DeTroyer-Cooley O, Pestrue J, et al. (2021) External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine 181(8):1065–1070.Crossref, Google Scholar
Wortsman M, Ilharco G, Li M, Kim JW, Hajishirzi H, Farhadi A, Namkoong H, Schmidt L (2021) Robust fine-tuning of zero-shot models. Preprint, submitted September 4, https://arxiv.org/abs/2109.01903.Google Scholar
Xu H, Caramanis C, Mannor S (2012) A distributional interpretation of robust optimization. Math. Oper. Res. 37(1):95–110.Link, Google Scholar
Yadlowsky S (2022) On cross-fitting with plug-in estimators. Accessed September 9, 2025, https://www.syadlowsky.com/blog/semiparametric/2022/10/24/on-cross-fitting-with-plug-in-estimators.html.Google Scholar
Yadlowsky S, Fleming S, Shah N, Brunskill E, Wager S (2021) Evaluating treatment prioritization rules via rank-weighted average treatment effects. Preprint, submitted November 15, https://arxiv.org/abs/2111.07966.Google Scholar
Yadlowsky S, Namkoong H, Basu S, Duchi J, Tian L (2022) Bounds on the conditional and average treatment effect with unobserved confounding factors. Ann. Statist. 50(5):2587–2615.Crossref, Google Scholar
Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK (2018) Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Medicine 15(11):e1002683.Crossref, Google Scholar
Zhang H, Singh H, Ghassemi M, Joshi S (2022) “Why did the model fail?”: Attributing model performance changes to distribution shifts. Preprint, submitted October 19, https://arxiv.org/abs/2210.10769v1.Google Scholar
Zhao Q (2019) Covariate balancing propensity score by tailored loss functions. Ann. Statist. 47(2):965–993.Crossref, Google Scholar

Volume 74, Issue 2

March-April 2026

Pages v-ix, 573-1152, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:April 24, 2023
Accepted:September 30, 2025
Published Online:December 18, 2025

Cite as

Tiffany (Tianhui) Cai, Hongseok Namkoong, Steve Yadlowsky (2025) Diagnosing Model Performance Under Distribution Shift. Operations Research 74(2):898-916.

https://doi.org/10.1287/opre.2023.0217

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Diagnosing Model Performance Under Distribution Shift

References

Volume 74, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News