Diagnosing Model Performance Under Distribution Shift
Abstract
Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, which we call distribution shift decomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for (1) an increase in harder but frequently seen examples from training, (2) changes in the relationship between features and outcomes, and (3) poor performance on examples infrequent or unseen during training. Empirically, we demonstrate how our method can (1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data and (2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification.
Funding: T. (T.) Cai was supported by the National Science Foundation Graduate Research Fellowship [Grant DGE-2036197]. H. Namkoong was partially supported by the Amazon Research Award.
Supplemental Material: All supplemental materials, including the code, data, and files required to reproduce the results, are available at https://doi.org/10.1287/opre.2023.0217.

