On the Power of Linear Programming for K-Means Clustering
Abstract
De Rosa and Khajavirad [De Rosa A, Khajavirad A (2022) The ratio-cut polytope and K-means clustering. SIAM J. Optim. 32(1):173–203] introduced a new polynomial-size linear programming (LP) relaxation for K-means clustering. In this paper, we further investigate both theoretical and computational properties of this relaxation. As evident from our numerical experiments with both synthetic and real-world data sets, the proposed LP relaxation is almost always tight, i.e., its optimal solution is feasible for the original nonconvex problem. To better understand this unexpected behavior, on the theoretical side, we focus on K-means clustering with two clusters, and we obtain sufficient conditions under which a given partition of data is an optimal solution of the LP relaxation. We further analyze the sufficient conditions when the input is generated according to a popular stochastic model and obtain recovery guarantees for the LP. We conclude our theoretical study by constructing a family of inputs for which the LP relaxation is never tight. Denoting by n the number of data points to be clustered, the LP relaxation contains inequalities, making it impractical for large data sets. To address the scalability issue, by building upon a cutting-plane algorithm together with the GPU implementation of PDLP, a first-order method LP solver, we develop an efficient algorithm that solves the proposed LP and hence, the K-means clustering problem for up to data points.
Funding: The authors were partially funded by the Air Force Office of Scientific Research [Grant FA9550-23-1-0123].

