Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop

Sheng-Tao Yang
Sheng-Tao Yang
[email protected]
https://orcid.org/0000-0003-0027-9606
Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339
Search for more papers by this author
,
Jye-Chyi Lu
Corresponding Author
Jye-Chyi Lu
[email protected]
Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339
Search for more papers by this author
,
Yu-Chung Tsao
Yu-Chung Tsao
[email protected]
https://orcid.org/0000-0001-5058-8728
Department of Industrial Management, National Taiwan University of Science and Technology, Taipei City 106, Taiwan
Search for more papers by this author

Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339

Search for more papers by this author

Jye-Chyi Lu

Corresponding Author

Jye-Chyi Lu

[email protected]

Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339

Search for more papers by this author

Yu-Chung Tsao

[email protected]

https://orcid.org/0000-0001-5058-8728

Department of Industrial Management, National Taiwan University of Science and Technology, Taipei City 106, Taiwan

Search for more papers by this author

Published Online:14 Mar 2025https://doi.org/10.1287/ijds.2022.9014

Abstract

This article proposes a novel decision-making procedure called human-in-the-loop clustering and representative selection (HITL-CARS) that involves users’ domain knowledge for analyzing high-dimensional data sets. The proposed method simultaneously clusters strongly correlated variables and estimates a linear regression model with only a few selected variables from cluster representatives and independent variables. In this work, we model the CARS procedure as a mixed-integer programming problem on the basis of penalized likelihood and partition around medoids clustering. After users obtain analysis results from CARS and provide their advice based on their domain knowledge, HITL-CARS refines analyses for accounting users’ inputs. Simulation studies show that the one-stage CARS performs better than the two-stage group Lasso and clustering representative Lasso in metrics such as true-positive, false-positive, exchangeable representative selection, and so on. Additionally, sensitivity and parameter misspecification studies present the robustness of the CARS to different preset parameters and provide guidance on how to start and adjust the HILT-CARS procedure. A real-life example of brain mapping data shows that HITL-CARS could aid in discovering important brain regions associated with depression symptoms and provide predictive analytics on cluster representatives.

History: Olivia Sheng served as the senior editor for this article.

Funding: S.-T. Yang and J.-C. Lu were partially supported by Lu’s 2023-24 Jim Pope Fellowship through The James G. and Dee H. Pope Faculty Fellows Endowment Fund at Georgia Institute of Technology.

Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.0310071.v1 and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2022.9014).

cover image INFORMS Journal on Data Science

Volume 4, Issue 2

April-June 2025

Pages iii-vi, 101-196, ii

Article Information

Supplemental Material

Metrics

Information

Received:May 23, 2022
Accepted:December 15, 2023
Published Online:March 14, 2025

Cite as

Sheng-Tao Yang; , Jye-Chyi Lu; , Yu-Chung Tsao (2025) Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop. INFORMS Journal on Data Science 4(2):154-172.

https://doi.org/10.1287/ijds.2022.9014

Keywords

Acknowledgments

The authors thank Dr. Cantor for providing the QEEG data and explaining the physical meanings of the signal variables and the anonymous reviewers, associate editors, and senior editor for careful reading of our manuscript and insightful comments and suggestions.

PDF download

Available Issues

Available Issues

Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop

Abstract

Volume 4, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News