Data Exploration by Representative Region Selection: Axioms and Convergence

Alexander S. Estes
Alexander S. Estes
[email protected]
https://orcid.org/0000-0003-4008-4627
Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, Minnesota 55455;
Search for more papers by this author
,
Michael O. Ball
Corresponding Author
Michael O. Ball
[email protected]
https://orcid.org/0000-0003-2757-8569
Robert H. Smith School of Business and Institute of Systems Research, University of Maryland, College Park, Maryland 20742;
Search for more papers by this author
,
David J. Lovell
Corresponding Author
David J. Lovell
[email protected]
https://orcid.org/0000-0001-8480-1910
Department of Civil and Environmental Engineering and Institute of Systems Research, University of Maryland, College Park, Maryland 20742
Search for more papers by this author

Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, Minnesota 55455;

Corresponding Author

Michael O. Ball

Robert H. Smith School of Business and Institute of Systems Research, University of Maryland, College Park, Maryland 20742;

Search for more papers by this author

David J. Lovell

Corresponding Author

David J. Lovell

[email protected]

https://orcid.org/0000-0001-8480-1910

Department of Civil and Environmental Engineering and Institute of Systems Research, University of Maryland, College Park, Maryland 20742

Search for more papers by this author

Published Online:18 Mar 2021https://doi.org/10.1287/moor.2020.1115

References

[1] Agarwal PK , Har-Peled S , Varadarajan KR (2005) Geometric approximation via coresets. Goodman JE , Pach J , Welzl E , eds. Combinatorial and Computational Geometry, vol. 52 (Cambridge University Press, Cambridge, UK), 1–30.Google Scholar
[2] Bādoiu M , Har-Peled S , Indyk P (2002) Approximate clustering via core-sets. Vitter JS, Spirakis P, Yannakakis M, eds. Proc. 34th Annual ACM Sympos. Theory Comput. (Association for Computing Machinery, New York), 250–257. Google Scholar
[3] Ben-David S , Ackerman M (2009) Measures of clustering quality: A working set of axioms for clustering . Koller D , Schuurmans D , Bengio Y , Bottou L , eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Inc., Red Hook, NY), 121–128.Google Scholar
[4] Bezdek JC , Hathaway RJ (2002) VAT: A tool for visual assessment of (cluster) tendency. Giles CL, ed. Proc. 2002 Internat. Joint Conf. Neural Networks, vol. 3 (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 2225–2230.Google Scholar
[5] Biau G , Devroye L , Lugosi G (2008) Consistency of random forests and other averaging classifiers. J. Machine Learn. Res. 9:2015–2033.Google Scholar
[6] Borgelt C (2012) Frequent item set mining. WIREs Data Mining Knowledge Discovery 2(6):437–456.Google Scholar
[7] Burges CJC (2010) Dimension reduction: A guided tour. Found. Trends Machine Learn. 2(4):275–365.Google Scholar
[8] Burman P , Polonik W (2009) Multivariate mode hunting: Data analytic tools with measures of significance. J. Multivariate Anal. 100(6):1198–1218.Google Scholar
[9] Carlsson G , Mémoli F (2010) Characterization, stability and convergence of hierarchical clustering methods. J. Machine Learn. Res. 11:1425–1470.Google Scholar
[10] Caruso C , Colorni A , Aloi L (2003) Dominant, an algorithm for the p-center problem. Eur. J. Oper. Res. 149(1):53–64.Google Scholar
[11] Chen D , Chen R (2009) New relaxation-based algorithms for the optimal solution of the continuous and discrete p-center problems. Comput. Oper. Res. 36(5):1646–1655.Google Scholar
[12] Clark RD (1997) OptiSim: An extended dissimilarity selection method for finding diverse representative subsets. J. Chemical Inform. Comput. Sci. 37(6):1181–1188.Google Scholar
[13] Daszykowski M , Walczak B , Massart D (2002) Representative subset selection. Analytica Chimica Acta 468(1):91–103.Google Scholar
[14] Davidović T , Ramljak D , Šelmić M , Teodorović D (2011) Bee colony optimization for the p-center problem. Comput. Oper. Res. 38(10):1367–1376.Google Scholar
[15] Elloumi S , Labbé M , Pochet Y (2004) A new formulation and resolution method for the p-center problem. INFORMS J. Comput. 16(1):84–94.Google Scholar
[16] Ester M , Kriegel HP , Sander J , Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. Simoudis E, Han J, Fayyad U, eds. Proc. Second Internat. Conf. Knowledge Discovery Data Mining (AAAI Press, Menlo Park, CA), 226–231.Google Scholar
[17] Estes A , Lovell DJ , Ball MO (2018) Unsupervised prototype reduction for data exploration and an application to air traffic management initiatives. EURO J. Transportation Logist. 8(5):467–510.Google Scholar
[18] Fahad A , Alshatri N , Tari Z , Alamri A , Khalil I , Zomaya AY , Foufou S , Bouras A (2014) A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans. Emerging Topics Comput. 2(3):267–279.Google Scholar
[19] Filippone M , Camastra F , Masulli F , Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recognition 41(1):176–190.Google Scholar
[20] Fodor IK (2002) A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, Livermore, CA.Google Scholar
[21] Friedman JH , Fisher NI (1999) Bump hunting in high-dimensional data. Statist. Comput. 9(2):123–143.Google Scholar
[22] Gajek L (1986) On improving density estimators which are not bona fide functions. Ann. Statist. 14(4):1612–1618.Google Scholar
[23] Garcia S , Derrac J , Cano J , Herrera F (2012) Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Machine Intelligence 34(3):417–435.Google Scholar
[24] Garey MR , Johnson DS (1990) Computers and Intractability: A Guide to the Theory of NP-Completeness (W. H. Freeman & Co., New York).Google Scholar
[25] Glad IK , Hjort NL , Ushakov NG (2003) Correction of density estimators that are not densities. Scandinavian J. Statist. 30(2):415–427.Google Scholar
[26] Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci. 38(Supplement C):293–306.Google Scholar
[27] Gorripaty S , Liu Y , Hansen M , Pozdnukhov A (2017) Identifying similar days for air traffic management. J. Air Transport Management 65(October):144–155.Google Scholar
[28] Halkidi M , Batistakis Y , Vazirgiannis M (2001) On clustering validation techniques. J. Intelligent Inform. Systems 17(2):107–145.Google Scholar
[29] Har-Peled S , Mazumdar S (2004) On coresets for k-means and k-median clustering. Kleinberg J, ed. Proc. 36th Annual ACM Sympos. Theory Comput. (Association for Computing Machinery, New York), 291–300.Google Scholar
[30] Hartigan JA (1981) Consistency of single linkage for high-density clusters. J. Amer. Statist. Assoc. 76(374):388–394.Google Scholar
[31] Hassin R , Levin A , Morad D (2003) Lexicographic local search and the p-center problem. Eur. J. Oper. Res. 151(2):265–279.Google Scholar
[32] Havens TC , Bezdek JC (2012) An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans. Knowledge Data Engrg. 24(5):813–822.Google Scholar
[33] Haynes TW , Hedetniemi S , Slater P (1998) Fundamentals of Domination in Graphs (CRC Press, New York).Google Scholar
[34] Hedetniemi S , Laskar R (1991) Bibliography on domination in graphs and some basic definitions of domination parameters. Hedetniemi ST, ed. Topics on Domination. Annals of Discrete Mathematics, vol. 48 (Elsevier, Amsterdam), 257–277.Google Scholar
[35] Ho CK , Singh YP , Ewe HT (2006) An enhanced ant colony optimization metaheuristic for the minimum dominating set problem. Applied Artificial Intelligence 20(10):881–903.Google Scholar
[36] Hochbaum DS , Shmoys DB (1985) A best possible heuristic for the k-center problem. Math. Oper. Res. 10(2):180–184.Google Scholar
[37] Huo X , Ni XS , Smith AK (2008) A survey of manifold-based learning methods. Liao TW, Triantaphyllou E, eds. Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications (World Scientific Publishing Co., Singapore), 691–746.Crossref, Google Scholar
[38] Ilhan T , Özsoy FA , Pinar M (2002) An efficient exact algorithm for the vertex p-center problem and computational experiments for different set covering subproblems. Technical Report 588, Department of Industrial Engineering, Bilkent University, Ankara, Turkey.Google Scholar
[39] Jones KS (2007) Automatic summarising: The state of the art. Inform. Processing Management 43(6):1449–1481.Google Scholar
[40] Kaluszka M (1998) On the Devroye-Györfi methods of correcting density estimators. Statist. Probab. Lett. 37(3):249–257.Google Scholar
[41] Kennard RW , Stone LA (1969) Computer aided design of experiments. Technometrics 11(1):137–148.Google Scholar
[42] Kleinberg J (2002) An impossibility theorem for clustering. Becker S, Thrun S, Obermayer K, eds. Proc. 15th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 463–470.Google Scholar
[43] Kosorok MR (2007) Introduction to Empirical Processes and Semiparametric Inference (Springer, New York).Google Scholar
[44] Kriegel HP , Kröger P , Zimek A (2009) Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discovery Data 3(1):1.Google Scholar
[45] Lloret E , Palomar M (2012) Text summarisation in progress: A literature review. Artifcial Intelligence Rev. 37(1):1–41.Google Scholar
[46] Mampaey M , Vreeken J (2013) Summarizing categorical data by clustering attributes. Data Mining Knowledge Discovery 26(1):130–173.Google Scholar
[47] Mampaey M , Vreeken J , Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans. Knowledge Discovery Data 6(4):16.Google Scholar
[48] Meinshausen N (2006) Quantile regression forests. J. Machine Learn. Res. 7:983–999.Google Scholar
[49] Mihelič J , Robič B (2003) Approximation algorithms for the k-center problem: An experimental evaluation. Leopold-Wildburger U, Rendl F, Wäscher G, eds. Internat. Conf. Oper. Res. 2002 (Springer, Berlin), 371–376.Google Scholar
[50] Minieka E (1970) The m-center problem. SIAM Rev. 12(1):138–139.Google Scholar
[51] Mladenović N , Labbé M , Hansen P (2003) Solving the p-center problem with tabu search and variable neighborhood search. Networks 42(1):48–64.Google Scholar
[52] Murtagh F , Contreras P (2017) Algorithms for hierarchical clustering: An overview, II. WIREs Data Mining Knowledge Discovery 7(6):e1219.Google Scholar
[53] Nenkova A , McKeown K (2011) Automatic summarization. Found. Trends Inform. Retrieval 5(23):103–233.Google Scholar
[54] Parekh AK (1991) Analysis of a greedy heuristic for finding small dominating sets in graphs. Inform. Processing Lett. 39(5):237–240.Google Scholar
[55] Pollard D (1981) Strong consistency of k-means clustering. Ann. Statist. 9(1):135–140.Google Scholar
[56] Puzicha J , Hofmann T , Buhmann JM (2000) A theory of proximity based clustering: structure detection by optimization. Pattern Recognition 33(4):617–634.Google Scholar
[57] Ram P , Gray AG (2011) Density estimation trees. Ghosh J, Smyth P, eds. Proc. 17th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 627–635.Google Scholar
[58] Robič B , Mihelič J (2005) Solving the k-center problem efficiently with a dominating set algorithm. J. Comput. Inform. Tech. 13(3):225–234.Crossref, Google Scholar
[59] Sanchis LA (2002) Experimental analysis of heuristic algorithms for the dominating set problem. Algorithmica 33(1):3–18.Google Scholar
[60] Schubert E , Sander J , Ester M , Kriegel HP , Xu X (2017) DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Systems 42(3):19:1–19:21.Google Scholar
[61] Scornet E , Biau G , Vert JP (2015) Consistency of random forests. Ann. Statist. 43(4):1716–1741.Google Scholar
[62] Scott DW (2015) Multivariate Density Estimation: Theory, Practice, and Visualization (John Wiley & Sons, New York, NY).Google Scholar
[63] Sheather SJ (2004) Density estimation. Statist. Sci. 19(4):588–597.Google Scholar
[64] Shorack GR , Wellner JA (2009) Empirical Processes with Applications to Statistics (Society for Industrial and Applied Mathematics, Philadelphia).Google Scholar
[65] Sun W , Wang J , Fang Y (2012) Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electronic J. Statist. 6:148–167.Google Scholar
[66] Triguero I , Derrac J , Garcia S , Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans. Systems, Man, Cybernetics, Part C (Appl. Rev.) 42(1):86–100.Google Scholar
[67] van Laarhoven T , Marchiori E (2014) Axioms for graph clustering quality functions. J. Machine Learn. Res. 15(6):193–215.Google Scholar
[68] Vapnik VN , Chervonenkis AY (2013) On the uniform convergence of the frequencies of occurrence of events to their probabilities. Schölkopf B, Luo Z, Vovk V, eds. Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik (Springer, Berlin), 7–12.Crossref, Google Scholar
[69] von Luxburg U , Belkin M , Bousquet O (2008) Consistency of spectral clustering. Ann. Statist. 36(2):555–586.Google Scholar
[70] Wang L , Nguyen UTV , Bezdek JC , Leckie CA , Ramamohanarao K (2010) iVAT and aVAT: Enhanced visual analysis for cluster tendency assessment. Zaki MJ, Yu JX, Ravindran B, Pudi V, eds. Advances in Knowledge Discovery and Data Mining (Springer, Berlin), 16–27.Google Scholar
[71] Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58(301):236–244.Google Scholar
[72] Xu R , Wunsch D (2005) Survey of clustering algorithms. IEEE Trans. Neural Networks 16(3):645–678.Google Scholar
[73] Yan X , Cheng H , Han J , Xin D (2005) Summarizing itemset patterns: A profile-based approach. Bayardo R, Bennett K, eds. Proc. 11th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 314–323.Google Scholar

cover image Mathematics of Operations Research

Volume 46, Issue 3

August 2021

Pages 835-1234, C2

Article Information

Metrics

Information

Received:July 19, 2018
Accepted:August 07, 2020
Published Online:March 18, 2021

Cite as

Alexander S. Estes , Michael O. Ball , David J. Lovell (2021) Data Exploration by Representative Region Selection: Axioms and Convergence. Mathematics of Operations Research 46(3):970-1007.

https://doi.org/10.1287/moor.2020.1115

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Data Exploration by Representative Region Selection: Axioms and Convergence

References

Volume 46, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News