Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop
Abstract
This article proposes a novel decision-making procedure called human-in-the-loop clustering and representative selection (HITL-CARS) that involves users’ domain knowledge for analyzing high-dimensional data sets. The proposed method simultaneously clusters strongly correlated variables and estimates a linear regression model with only a few selected variables from cluster representatives and independent variables. In this work, we model the CARS procedure as a mixed-integer programming problem on the basis of penalized likelihood and partition around medoids clustering. After users obtain analysis results from CARS and provide their advice based on their domain knowledge, HITL-CARS refines analyses for accounting users’ inputs. Simulation studies show that the one-stage CARS performs better than the two-stage group Lasso and clustering representative Lasso in metrics such as true-positive, false-positive, exchangeable representative selection, and so on. Additionally, sensitivity and parameter misspecification studies present the robustness of the CARS to different preset parameters and provide guidance on how to start and adjust the HILT-CARS procedure. A real-life example of brain mapping data shows that HITL-CARS could aid in discovering important brain regions associated with depression symptoms and provide predictive analytics on cluster representatives.
History: Olivia Sheng served as the senior editor for this article.
Funding: S.-T. Yang and J.-C. Lu were partially supported by Lu’s 2023-24 Jim Pope Fellowship through The James G. and Dee H. Pope Faculty Fellows Endowment Fund at Georgia Institute of Technology.
Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.0310071.v1 and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2022.9014).

