We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features.
To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM.
I've been currently researching for a faster solution.
What I have already tested:
Kmeans + Mean Shift Combination - a little better (for k=1024 --> ~13h) but still slow.
Kmcuda library - doesn't have support for sparse matrix representation. It would require ~3TB RAM to represent that dataset as a dense matrix in memory.
Tensorflow (tf.contrib.factorization.python.ops.KmeansClustering()) - only started investigation today, but either I am doing something wrong, or I do not know how to cook it. On my first test with 20k samples and 500 features, clustering on a single GPU is slower than on CPU in 1 thread.
Facebook FAISS - no support for sparse representation.
There is PySpark MlLib Kmeans next on my list. But would it make sense on 1 node?
Would it be training for my use-case faster on multiple GPUs? e.g., TensorFlow with 8 Tesla V-100?
Is there any magical library that I haven't heard of?
Or just simply scale vertically?