Faster Kmeans Clustering on High-dimensional Data with GPU Support

S

2

5

We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features.

To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM.

I've been currently researching for a faster solution.

What I have already tested:

Kmeans + Mean Shift Combination - a little better (for k=1024 --> ~13h) but still slow.
Kmcuda library - doesn't have support for sparse matrix representation. It would require ~3TB RAM to represent that dataset as a dense matrix in memory.
Tensorflow (tf.contrib.factorization.python.ops.KmeansClustering()) - only started investigation today, but either I am doing something wrong, or I do not know how to cook it. On my first test with 20k samples and 500 features, clustering on a single GPU is slower than on CPU in 1 thread.
Facebook FAISS - no support for sparse representation.

There is PySpark MlLib Kmeans next on my list. But would it make sense on 1 node?

Would it be training for my use-case faster on multiple GPUs? e.g., TensorFlow with 8 Tesla V-100?

Is there any magical library that I haven't heard of?

Or just simply scale vertically?

Snowball answered 11/10, 2019 at 18:13 Comment(4)

pyspark in 1 node would not make sense indeed; have a look at RAPIDS cuML – Kaule 11/10, 2019 at 22:3

@Kaule thank you a lot. I think this is what I've been looking for. – Snowball 12/10, 2019 at 3:46

@desertnaut, do you have any experience with this library? After playing for a couple of days with it, it seems that the dataset must be converted to cudf.Dataframe (GPU object). Right? I can't imagine how would such a dataset fit into GPU memory. – Snowball 14/10, 2019 at 10:1

Sorry, I don't have yet – Kaule 14/10, 2019 at 10:2

S

1

thanks to @desertnaut for his suggestion with RAPIDS cuml library.

The follow up can be found here.

Snowball answered 15/10, 2019 at 7:37 Comment(3)

Have you compared the results (in quality and run time) to that obtained with a single CPU, the C++ implementations of Greg Hamerly github.com/ghamerly/fast-kmeans and a sample that easily fits into main memory? – Gipon 10/2, 2020 at 13:47

@HasQUIT--Anony-Mousse nope. I haven't seen that repository before. Hmm.Strange because I've searched the web a lot. But we end up speeding the clustering by using more threads. – Snowball 11/2, 2020 at 13:40

It supports colab, rapids.ai , This library seems good, saved me a lot of time. – Coloration 18/4, 2020 at 11:33

G

6

Choose the algorithm wisely. There are clever algorithms, and there are stupid algorithms for kmeans. Lloyd's is stupid, but the only one you will find in GPUs so far. It wastes a lot of resources with unnecessary computations. Because GPU and "big data" people do not care about resource efficiency... Good algorithms include Elkan's, Hamerly's, Ying-Yang, Exponion, Annulus, etc. - these are much faster than Lloyd's.

Sklearn is one of the better tools here, because it at least includes Elkan's algorithm. But if I am not mistaken, it may be making a dense copy of your data repeatedly. Maybe in chunks so you don't notice it. When I compared k-means from sklearn with my own spherical k-means in Python, my implementation was many times faster. I can only explain this with me using sparse optimizations while the sklearn version performed dense operations. But maybe this has been improved since.
Implementation quality is important. There was an interesting paper about benchmarking k-means. Let me Google it:

Kriegel, H. P., Schubert, E., & Zimek, A. (2017). The (black) art of runtime evaluation: Are we comparing algorithms or implementations?. Knowledge and Information Systems, 52(2), 341-378.

They show how supposedly the same algorithm can have orders f magnitude runtime differences, depending on implementation differences. Spark does not fare very well there... It has too high overheads, too slow algorithms.
You don't need all the data.

K-means works with averages. The quality of the mean very slowly improves as you add more data. So there is little use in using all the data you have. Just use a large enough sample, and the results should be of almost the same quality. You can exploit this also for seeding. Run on a smaller set first, then add more data for refinement.
Because your data is sparse, there is a high chance that k-means is not the right tools anyway. Have you tested the quality of your results? How do you ensure attributes to be appropriately scaled? How much is the result determined simply by where the vectors are 0, and not by the actual non-zero values? Do results actually improve with rerunning k-means so often? What if you d not rerun k-means ever again? What if you just run it on a sample as discussed in 3)? What if you just pick k random centers and do 0 iterations of k-means? What is your best Silhouette? Chances are that you cannot measure the difference and are just wasting time and resources for nothing! So what do you do to ensure reliability of your results?

Gipon answered 11/10, 2019 at 20:16 Comment(0)

S

1