An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance

Mohammadmahdi Ghasemloo
Corresponding Author
Mohammadmahdi Ghasemloo
[email protected]
https://orcid.org/0009-0005-2444-1956
Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843
Search for more papers by this author
,
David J. Eckman
David J. Eckman
[email protected]
https://orcid.org/0000-0002-6473-6434
Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843
Search for more papers by this author

Mohammadmahdi Ghasemloo

Corresponding Author

Mohammadmahdi Ghasemloo

[email protected]

https://orcid.org/0009-0005-2444-1956

Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843

Search for more papers by this author

David J. Eckman

[email protected]

https://orcid.org/0000-0002-6473-6434

Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843

Search for more papers by this author

Published Online:17 Sep 2025https://doi.org/10.1287/ijds.2024.0056

References

Abdallah I, Tatsis K, Chatzi E (2020) Unsupervised local cluster-weighted bootstrap aggregating the output from multiple stochastic simulators. Reliability Engrg. System Safety 199:106876.Google Scholar
Altschuler J, Niles-Weed J, Rigollet P (2017) Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates, Inc., Red Hook, NY), 1964–1974.Google Scholar
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. Precup D, Teh YW, eds. Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 214–223.Google Scholar
Benamou J-D, Carlier G, Cuturi M, Nenna L, Peyré G (2015) Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37(2):A1111–A1138.Google Scholar
Chakraborty S, Paul D, Das S (2020) Hierarchical clustering with optimal transport. Statist. Probab. Lett. 163:108781.Google Scholar
Cont R (2001) Empirical properties of asset returns: Stylized facts and statistical issues. Quant. Finance 1(2):223–236.Google Scholar
Cuturi M (2013) Sinkhorn distances: Lightspeed computation of optimal transport. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY), 2292–2300.Google Scholar
Cuturi M, Doucet A (2014) Fast computation of Wasserstein barycenters. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (PMLR, New York), 685–693.Google Scholar
Del Barrio E, Cuesta-Albertos JA, Matrán C, Mayo-Íscar A (2019) Robust clustering tools based on optimal transportation. Statist. Comput. 29:139–160.Google Scholar
Donat MG, Alexander LV, Yang H, Durre I, Vose R, Caesar J (2013) Updated analyses of temperature and precipitation extreme indices since the beginning of the twentieth century: The HadEX2 dataset. J. Geophysical Res. Atmospheres 118(5):2098–2118.Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(25):14863–14868.Google Scholar
Genevay A, Peyré G, Cuturi M (2018) Learning generative models with Sinkhorn divergences. Storkey A, Perez-Cruz F, eds. Internat. Conf. Artificial Intelligence Statist., vol. 84 (PMLR, New York), 1608–1617.Google Scholar
Haneuse S, Wakefield J (2009) Adjusting for bias due to missing data in clinical studies using auxiliary data and empirical distributions. Biostatistics 10(2):245–257.Google Scholar
Henderson K, Gallagher B, Eliassi-Rad T (2015) EP-MEANS: An efficient nonparametric clustering of empirical probability distributions. Proc. 30th Annual ACM Sympos. Appl. Comput. (ACM, New York), 893–900.Google Scholar
Horvath B, Issa Z, Muguruza A (2021) Clustering market regimes using the Wasserstein distance. Preprint, submitted October 22, https://arxiv.org/abs/2110.11848.Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J. Classification 2:193–218.Google Scholar
Irpino A, Verde R, de AT Carvalho F (2014) Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems Appl. 41(7):3351–3366.Google Scholar
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognition Lett. 31(8):651–666.Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: A review. ACM Comput. Surveys 31(3):264–323.Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning: With Applications in R, vol. 112 (Springer, New York).Google Scholar
Karthikeyan B, George DJ, Manikandan G, Thomas T (2020) A comparative study on k-means clustering and agglomerative hierarchical clustering. Internat. J. Emerging Trends Engrg. Res. 8(5):1600–1604.Google Scholar
Kelton WD (2006) Implementing representations of uncertainty. Henderson SG, Nelson BL, eds. Simulation, Handbooks in Operations Research and Management Science, vol. 13 (Elsevier, Amsterdam), 181–191.Google Scholar
Li H, Lam H, Peng Y (2024) Efficient learning for clustering and optimizing context-dependent designs. Oper. Res. 72(2):617–638.Link, Google Scholar
Liu Y, Zheng Y, Peng Y, Yuan W (2021) A framework of digital twin generation for structural dynamic monitoring of offshore platform. Ocean Engrg. 237:109599.Google Scholar
Loureiro A, Torgo L, Soares C (2004) Outlier detection using clustering methods: A data cleaning application. Proc. KDNet Sympos. Knowledge-Based Systems Public Sector (Springer, New York).Google Scholar
Montgomery DC (2009) Empirical distributions of process data. Introduction to Statistical Quality Control, 6th ed. (John Wiley & Sons, Hoboken, NJ).Google Scholar
Mur A, Dormido R, Duro N, Dormido-Canto S, Vega J (2016) Determination of the optimal number of clusters using a spectral clustering optimization. Expert Systems Appl. 65:304–314.Google Scholar
Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: An overview. Data Mining Knowledge Discovery 2(1):86–97.Google Scholar
Nelson BL (2016) Some tactical problems in digital simulation for the next 10 years. J. Simulation 10(1):2–11.Google Scholar
Pappas TN, Jayant NS (1989) An adaptive clustering algorithm for image segmentation. Internat. Conf. Acoustics Speech Signal Processing, vol. 3 (IEEE, Piscataway, NJ), 1667–1670.Google Scholar
Peng Y, Xu J, Lee LH, Hu J, Chen C-H (2018) Efficient simulation sampling allocation using multifidelity models. IEEE Trans. Automatic Control 64(8):3156–3169.Google Scholar
Peyré G, Cuturi M (2019) Computational optimal transport: With applications to data science. Foundations Trends Machine Learn. 11(5–6):355–607.Google Scholar
Riess L, Beiglböck M, Temme J, Wolf A, Backhoff J (2023) The geometry of financial institutions—Wasserstein clustering of financial data. Preprint, submitted May 5, https://arxiv.org/abs/2305.03565.Google Scholar
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Internat. J. Comput. Vision 40:99–121.Google Scholar
Santambrogio F, Wang X-J (2016) Convexity of the support of the displacement interpolation: Counterexamples. Appl. Math. Lett. 58:152–158.Google Scholar
Shahapure KR, Nicholas C (2020) Cluster quality analysis using silhouette score. IEEE Seventh Internat. Conf. Data Sci. Adv. Anal. (IEEE, Piscataway, NJ), 747–748.Google Scholar
Villani C (2009) Optimal Transport: Old and New, vol. 338 (Springer, Berlin).Google Scholar
Wedel M, Kamakura WA (2000) Market Segmentation: Conceptual and Methodological Foundations (Kluwer Academic Publishers, Boston).Google Scholar
Zhang Z, Peng Y (2024) Sample-efficient clustering and conquer procedures for parallel large-scale ranking and selection. Preprint, submitted February 3, https://arxiv.org/abs/2402.02196.Google Scholar
Zhuang Y, Chen X, Yang Y (2022) Wasserstein K-means for clustering probability distributions. Proc. 36th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc. Red Hook, NY), 11382–11395.Google Scholar

cover image INFORMS Journal on Data Science

Volume 5, Issue 1

January-March 2026

Pages iii-iv, 1-80, ii

Article Information

Supplemental Material

Metrics

Information

Received:November 01, 2024
Accepted:July 31, 2025
Published Online:September 17, 2025

Cite as

Mohammadmahdi Ghasemloo, David J. Eckman (2025) An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance. INFORMS Journal on Data Science 5(1):65-80.

https://doi.org/10.1287/ijds.2024.0056

Keywords

Acknowledgments

The authors thank Morteza Davari for helpful discussions about the online monitoring application and thank the associate editor and reviewers for helpful comments that improved the paper. No data ethics considerations are foreseen related to this paper.

PDF download

Available Issues