An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance

Mohammadmahdi Ghasemloo
Corresponding Author
Mohammadmahdi Ghasemloo
[email protected]
https://orcid.org/0009-0005-2444-1956
Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843
Search for more papers by this author
,
David J. Eckman
David J. Eckman
[email protected]
https://orcid.org/0000-0002-6473-6434
Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843
Search for more papers by this author

Mohammadmahdi Ghasemloo

Corresponding Author

Mohammadmahdi Ghasemloo

[email protected]

https://orcid.org/0009-0005-2444-1956

Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843

Search for more papers by this author

David J. Eckman

[email protected]

https://orcid.org/0000-0002-6473-6434

Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843

Search for more papers by this author

Published Online:17 Sep 2025https://doi.org/10.1287/ijds.2024.0056

Supplemental Material

Software and Data: ijds.2024.0056.cd.zip

Description of Software and Data

The code and data in the zip file referenced above are a snapshot of the software and data that were used in the research reported in the paper "An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance" by Mohammadmahdi Ghasemloo and David J. Eckman. This repository is also available via GitHub.

The goal of this repository is to replicate the numerical experiments in the paper.

Computer and Software Environment

The code was executed using Jupyter Notebook on a Windows 64-bit system equipped with a 12th Gen Intel(R) Core(TM) i7-1255U processor (1.70 GHz) and 16 GB of RAM.

Dependencies

Installation

By installing Jupyter notebook one will be able to open and run all the files.

Reproducibility Workflow

To reproduce the results in Figure 1

Data File: regularized.txt and unregularized.txt
Code Folder: Regularized and unregularized Wasserstein distance
Code File: Draw Figures.ipynb
Output: The plot in Figure 1
Run Time at the Above-Specified Computer Conditions: ~10 minutes

To reproduce the results in Figure 2

Data File: Generates the tuples_list_#.txt
Code Folder: Agglomerative vs Kmeans
Code File: Data Generation.ipynb
Output: Generates the data used in the analysis
Run Time at the Above-Specified Computer Conditions: A few seconds

Data File: Tuple_list_#.txt
Code Folder: Agglomerative vs Kmeans
Code File: Agglomerative_comparison.ipynb
Output: Agglomerative_support.txt and Agglomerative_systems.txt
Run Time at the Above-Specified Computer Conditions: ~2 minutes

Data File: Tuple_list_#.txt
Code Folder: Agglomerative vs Kmeans
Code File: K_means.ipynb
Output: Kmeans_support.txt and Kmeans _systems.txt
Run Time at the Above-Specified Computer Conditions: ~30 minutes

Data File: Agglomerative_support.txt and Agglomerative_systems.txt, Kmeans_support.txt and Kmeans _systems.txt
Code Folder: Agglomerative vs Kmeans
Code File: Draw Figures.ipynb
Output: Plots in Figure 2
Run Time at the Above-Specified Computer Conditions: A few seconds

To reproduce the results in Figure 5

Data File: Eps_updated.csv, analyst_profile.csv, ibes.csv
Code File: IJDS-ResolveConflictsInCrowds-20251002-Final.ipynb
Output: The plot in Figure 5, in one pdf file
Run Time at the Above-Specified Computer Conditions: 10 seconds

To reproduce the results in Table 4

Data File: weight_0721.csv, weight_only_0720.csv
Code File: IJDS-ResolveConflictsCrowds-20251002-Final.R
Output: The values stated in table 4, in a tex file
Run Time at the Above-Specified Computer Conditions: 10 seconds

To reproduce the results in Table 5

Data File: weight_0721.csv, weight_only_0720.csv
Code File: IJDS-ResolveConflictsCrowds-20251002-Final.R
Output: The values stated in table 5, in a tex file
Run Time at the Above-Specified Computer Conditions: 10 seconds

To reproduce the results in Table 6

Data File: weight_0721.csv, weight_only_0720.csv
Code File: IJDS-ResolveConflictsCrowds-20251002-Final.R
Output: The values stated in table 6, in a tex file
Run Time at the Above-Specified Computer Conditions: 10 seconds

To reproduce the results in Table 7

Data File: weight_0721.csv, weight_only_0720.csv
Code File: IJDS-ResolveConflictsCrowds-20251002-Final.R
Output: The values stated in table 7, in a tex file
Run Time at the Above-Specified Computer Conditions: 10 seconds

To reproduce the results in Table 8

Data File: weight_0721.csv, weight_only_0720.csv
Code File: IJDS-ResolveConflictsCrowds-20251002-Final.R
Output: The values stated in table 8, in a tex file
Run Time at the Above-Specified Computer Conditions: 10 seconds

To reproduce the results in Table 9

Data File: weight_0721.csv, weight_only_0720.csv
Code File: IJDS-ResolveConflictsCrowds-20251002-Final.R
Output: The values stated in table 9, in a tex file
Run Time at the Above-Specified Computer Conditions: 10 seconds

Note

All the Data Files are in the data folder. Running simulation codes will overwrite the simulated results. The codes have been designed in a way that they save the figures in the “results” folder. We have uploaded the data used to produce our results in the data_backup folder to ensure it is preserved in case the files in the data folder are overwritten when running the simulation codes.

Ongoing Development

A python package named distclust has been developed that can be used to perform the agglomerative clustering on empirical multivariate distributions and perform further analysis. More information regarding this package can be found on GitHub.

Cite

To cite the contents of this repository, please cite both the paper and this repository using their respective DOIs.

Article: https://doi.org/10.1287/ijds.2024.0056
Software and Data Repository: https://doi.org/10.1287/ijds.2024.0056.cd

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

cover image INFORMS Journal on Data Science

Volume 5, Issue 1

January-March 2026

Pages iii-iv, 1-80, ii

Article Information

Supplemental Material

Metrics

Information

Received:November 01, 2024
Accepted:July 31, 2025
Published Online:September 17, 2025

Cite as

Mohammadmahdi Ghasemloo, David J. Eckman (2025) An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance. INFORMS Journal on Data Science 5(1):65-80.

https://doi.org/10.1287/ijds.2024.0056

Keywords

Acknowledgments

The authors thank Morteza Davari for helpful discussions about the online monitoring application and thank the associate editor and reviewers for helpful comments that improved the paper. No data ethics considerations are foreseen related to this paper.

PDF download

Available Issues

Available Issues

An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance

Supplemental Material

Description of Software and Data

Computer and Software Environment

Dependencies

Installation

Reproducibility Workflow

To reproduce the results in Figure 1

To reproduce the results in Figure 2

To reproduce the results in Figure 5

To reproduce the results in Table 4

To reproduce the results in Table 5

To reproduce the results in Table 6

To reproduce the results in Table 7

To reproduce the results in Table 8

To reproduce the results in Table 9

Note

Ongoing Development

Cite

License

Volume 5, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News