SparseKmeans: Efficient K-means Clustering For Sparse Data
Nguyen Pham Dang, Khoi ; Lin, He Zhe ; Lin, Chihjen
Nguyen Pham Dang, Khoi
Lin, He Zhe
Lin, Chihjen
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We introduce SparseKmeans, the first Python package for fast K-means clustering on high-dimensional sparse data. Most existing K-means implementations, such as scikit-learn, are only optimized for dense data and do not run efficiently on sparse inputs. In this work, we thoroughly investigate how to accelerate widely used K-means algorithms on sparse data via matrix operations. In particular, we propose a new design of Elkan's method that aggregates distance computations and reduces fragmented memory access. By analyzing the structure of key matrices and leveraging highly optimized sparse matrix libraries, SparseKmeans achieves up to 9x speedup over scikit-learn. The package is available at https://github.com/cjlin1/sparsekmeans.
Citation
K. Nguyen Pham Dang, H. Z. Lin, and C. J. Lin, “SparseKmeans: Efficient K-means Clustering For Sparse Data,” CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 6481–6485, Nov. 2025, doi: 10.1145/3746252.3761646
Source
Proceedings of the 34th ACM International Conference on Information and Knowledge Management
Conference
34th ACM International Conference on Information and Knowledge Management, CIKM 2025
Keywords
clustering, elkan's method, k-means algorithm, lloyd's method, sparse matrix operations, unsupervised learning
Subjects
Source
34th ACM International Conference on Information and Knowledge Management, CIKM 2025
Publisher
Association for Computing Machinery
