Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop

Sheng-Tao Yang
Sheng-Tao Yang
[email protected]
https://orcid.org/0000-0003-0027-9606
Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339
Search for more papers by this author
,
Jye-Chyi Lu
Corresponding Author
Jye-Chyi Lu
[email protected]
Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339
Search for more papers by this author
,
Yu-Chung Tsao
Yu-Chung Tsao
[email protected]
https://orcid.org/0000-0001-5058-8728
Department of Industrial Management, National Taiwan University of Science and Technology, Taipei City 106, Taiwan
Search for more papers by this author

Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339

Search for more papers by this author

Jye-Chyi Lu

Corresponding Author

Jye-Chyi Lu

[email protected]

Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339

Search for more papers by this author

Yu-Chung Tsao

[email protected]

https://orcid.org/0000-0001-5058-8728

Department of Industrial Management, National Taiwan University of Science and Technology, Taipei City 106, Taiwan

Search for more papers by this author

Published Online:14 Mar 2025https://doi.org/10.1287/ijds.2022.9014

References

Afrabandpey H, Peltola T, Kaski S (2019) Human-in-the-loop active covariance learning for improving prediction in small data sets. Kraus S, ed. Proc. 28th Internat. Joint Conf. Artificial Intelligence, 1959–1966 (AAAI Press, Washinton, DC).Google Scholar
Amershi S, Cakmak M, Knox WB, Kulesza T (2014) Power to the people: The role of humans in interactive machine learning. AI Magazine 35(4):105–120.Google Scholar
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. Berry MW, Dayal U, Kamath C, Skillicorn D, eds. Proc. SIAM Internat. Conf. Data Mining (SIAM, Philadelphia, PA), 333–344.Google Scholar
Breheny P, Huang J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist. 5(1):232.Google Scholar
Bühlmann P, Rütimann P, van de Geer S, Zhang CH (2013) Correlated variables in regression: Clustering and sparse estimation. J. Statist. Planning Inference 143(11):1835–1858.Google Scholar
Cantor DS, Stevens E (2009) QEEG correlates of auditory-visual entrainment treatment efficacy of refractory depression. J. Neurotherapy 13(2):100–108.Google Scholar
Conforti M, Cornuéjols G, Zambelli G (2014) Integer Programming Models (Springer International Publishing, Cham, Switzerland).Google Scholar
Dettling M, Bühlmann P (2004) Finding predictive gene groups from microarray data. J. Multivariate Anal. (Oxford) 90(1):106–131.Google Scholar
Fails JA, Olsen DR Jr (2003) Interactive machine learning. Johnson WL, Andre E, Domingue J, eds. Proc. 8th Internat. Conf. Intelligent User Interfaces (Association for Computing Machinery, New York), 39–45.Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96(456):1348–1360.Google Scholar
Fan J, Shao QM, Zhou WX (2018) Are discoveries spurious? Distributions of maximum spurious correlations and their applications. Ann. Statist. 46(3):989.Google Scholar
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann. Appl. Statist. 1(2):302–332.Google Scholar
Gkorou D, Larrañaga M, Ypma A, Hasibi F, Wijk R (2020) Get a human-in-the-loop: Feature engineering via interactive visualizations. Kottke D, Krempl G, Lemaire V, Holzinger A, Calma A, eds. Proc. Workshop Interactive Adaptive Learn. Co-located European Conf. Machine Learn. Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020) (Vilnius, Lithuania), vol. 2660, 90–95.Google Scholar
Hastie T, Tibshirani R, Botstein D, Brown P (2001) Supervised harvesting of expression trees. Genome Biology 2(1):research0003–1.Google Scholar
Hennig C (2007) Cluster-wise assessment of cluster stability. Comput. Statist. Data Anal. (Oxford) 52(1):258–271.Google Scholar
Hunter DR, Li R (2005) Variable selection using MM algorithms. Ann. Statist. 33(4):1617.Google Scholar
Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344 (John Wiley & Sons, New York).Google Scholar
Park MY, Hastie T, Tibshirani R (2007) Averaged gene expressions for regression. Biostatistics 8(2):212–227.Google Scholar
Pochet Y, Wolsey LA (2006) Production Planning by Mixed Integer Programming, vol. 149 (Springer Science & Business Media, New York).Google Scholar
Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: Promise and potential. Health Inform. Sci. Systems 2(1):1–10.Google Scholar
Schubert E, Rousseeuw PJ (2019) Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. Amato G, Gennaro C, Oria V, Radovanovic M, eds. Proc. Internat. Conf. Similarity Search Applications (Springer, Cham, Switzerland), 171–187.Google Scholar
Scott DW (1991) Feasibility of multivariate density estimates. Biometrika 78(1):197–205.Google Scholar
Sharma DB, Bondell HD, Zhang HH (2013) Consistent group identification and variable selection in regression with correlated predictors. J. Comput. Graphics Statist. 22(2):319–340.Google Scholar
Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100(16):9440–9445.Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B (Methodological) 58(1):267–288.Google Scholar
Tseng GC (2007) Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics 23(17):2247–2255.Google Scholar
Van der Maaten L, Hinton G (2008) Visualizing data using T-SNE. J. Machine Learn. Res. 9(86):2579–2605.Google Scholar
van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proc. Biological Sci. 265(1412):2315–2320.Google Scholar
Wang G, Sarkar A, Carbonetto P, Stephens M (2020) A simple new approach to variable selection in regression, with application to genetic fine-mapping. J. Roy. Statist. Soc. Ser. B Statist. Methodology 82(5):1273–1300.Google Scholar
Witten DM, Shojaie A, Zhang F (2014) The cluster elastic net for high-dimensional regression with unknown variable grouping. Technometrics 56(1):112–122.Google Scholar
Yang ST (2023) Analysis of high-dimensional data with variable clustering and selection. PhD thesis, Georgia Institute of Technology, Atlanta.Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B (Statist. Methodological) 68(1):49–67.Google Scholar
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38(2):894–942.Google Scholar
Zou H (2006) The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101(476):1418–1429.Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B (Statist. Methodological) 67(2):301–320.Google Scholar

cover image INFORMS Journal on Data Science

Volume 4, Issue 2

April-June 2025

Pages iii-vi, 101-196, ii

Article Information

Supplemental Material

Metrics

Information

Received:May 23, 2022
Accepted:December 15, 2023
Published Online:March 14, 2025

Cite as

Sheng-Tao Yang; , Jye-Chyi Lu; , Yu-Chung Tsao (2025) Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop. INFORMS Journal on Data Science 4(2):154-172.

https://doi.org/10.1287/ijds.2022.9014

Keywords

Acknowledgments

The authors thank Dr. Cantor for providing the QEEG data and explaining the physical meanings of the signal variables and the anonymous reviewers, associate editors, and senior editor for careful reading of our manuscript and insightful comments and suggestions.

PDF download

Available Issues

Available Issues

Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop

References

Volume 4, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News