An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance
Supplemental Material
Software and Data: ijds.2024.0056.cd.zip
Description of Software and Data
The code and data in the zip file referenced above are a snapshot of the software and data that were used in the research reported in the paper "An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance" by Mohammadmahdi Ghasemloo and David J. Eckman. This repository is also available via GitHub.
The goal of this repository is to replicate the numerical experiments in the paper.
Computer and Software Environment
The code was executed using Jupyter Notebook on a Windows 64-bit system equipped with a 12th Gen Intel(R) Core(TM) i7-1255U processor (1.70 GHz) and 16 GB of RAM.
Dependencies
Installation
By installing Jupyter notebook one will be able to open and run all the files.
Reproducibility Workflow
To reproduce the results in Figure 1
- Data File:
regularized.txtandunregularized.txt - Code Folder: Regularized and unregularized Wasserstein distance
- Code File:
Draw Figures.ipynb - Output: The plot in Figure 1
- Run Time at the Above-Specified Computer Conditions: ~10 minutes
To reproduce the results in Figure 2
- Data File: Generates the
tuples_list_#.txt - Code Folder: Agglomerative vs Kmeans
- Code File:
Data Generation.ipynb - Output: Generates the data used in the analysis
- Run Time at the Above-Specified Computer Conditions: A few seconds
- Data File:
Tuple_list_#.txt - Code Folder: Agglomerative vs Kmeans
- Code File:
Agglomerative_comparison.ipynb - Output:
Agglomerative_support.txtandAgglomerative_systems.txt - Run Time at the Above-Specified Computer Conditions: ~2 minutes
- Data File:
Tuple_list_#.txt - Code Folder: Agglomerative vs Kmeans
- Code File:
K_means.ipynb - Output:
Kmeans_support.txtandKmeans _systems.txt - Run Time at the Above-Specified Computer Conditions: ~30 minutes
- Data File:
Agglomerative_support.txtandAgglomerative_systems.txt,Kmeans_support.txtandKmeans _systems.txt - Code Folder: Agglomerative vs Kmeans
- Code File:
Draw Figures.ipynb - Output: Plots in Figure 2
- Run Time at the Above-Specified Computer Conditions: A few seconds
To reproduce the results in Figure 5
- Data File:
Eps_updated.csv,analyst_profile.csv,ibes.csv - Code File:
IJDS-ResolveConflictsInCrowds-20251002-Final.ipynb - Output: The plot in Figure 5, in one pdf file
- Run Time at the Above-Specified Computer Conditions: 10 seconds
To reproduce the results in Table 4
- Data File:
weight_0721.csv,weight_only_0720.csv - Code File:
IJDS-ResolveConflictsCrowds-20251002-Final.R - Output: The values stated in table 4, in a tex file
- Run Time at the Above-Specified Computer Conditions: 10 seconds
To reproduce the results in Table 5
- Data File:
weight_0721.csv,weight_only_0720.csv - Code File:
IJDS-ResolveConflictsCrowds-20251002-Final.R - Output: The values stated in table 5, in a tex file
- Run Time at the Above-Specified Computer Conditions: 10 seconds
To reproduce the results in Table 6
- Data File:
weight_0721.csv,weight_only_0720.csv - Code File:
IJDS-ResolveConflictsCrowds-20251002-Final.R - Output: The values stated in table 6, in a tex file
- Run Time at the Above-Specified Computer Conditions: 10 seconds
To reproduce the results in Table 7
- Data File:
weight_0721.csv,weight_only_0720.csv - Code File:
IJDS-ResolveConflictsCrowds-20251002-Final.R - Output: The values stated in table 7, in a tex file
- Run Time at the Above-Specified Computer Conditions: 10 seconds
To reproduce the results in Table 8
- Data File:
weight_0721.csv,weight_only_0720.csv - Code File:
IJDS-ResolveConflictsCrowds-20251002-Final.R - Output: The values stated in table 8, in a tex file
- Run Time at the Above-Specified Computer Conditions: 10 seconds
To reproduce the results in Table 9
- Data File:
weight_0721.csv,weight_only_0720.csv - Code File:
IJDS-ResolveConflictsCrowds-20251002-Final.R - Output: The values stated in table 9, in a tex file
- Run Time at the Above-Specified Computer Conditions: 10 seconds
Note
All the Data Files are in the data folder. Running simulation codes will overwrite the simulated results. The codes have been designed in a way that they save the figures in the “results” folder. We have uploaded the data used to produce our results in the data_backup folder to ensure it is preserved in case the files in the data folder are overwritten when running the simulation codes.
Ongoing Development
A python package named distclust has been developed that can be used to perform the agglomerative clustering on empirical multivariate distributions and perform further analysis. More information regarding this package can be found on GitHub.
Cite
To cite the contents of this repository, please cite both the paper and this repository using their respective DOIs.
Article: https://doi.org/10.1287/ijds.2024.0056
Software and Data Repository: https://doi.org/10.1287/ijds.2024.0056.cd
License
Copyright (c) (2025 Ghasemloo and Eckman)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

