How to speed-up k-means from Scikit learn?
Asked Answered
D

2

14

On my project I have used k-means to classify data between groups, but I have a problem with the computation of the k-means from Scikit-learn - it was very slow. I need to boost it.

I have tried to change the number of n_jobs to -1, but still very slow!

Any suggestions how to speed up?

Differ answered 1/10, 2017 at 18:23 Comment(4)
What sort of data are you working with? You need to provide more details, there's no magic bullet, and I doubt the issue is scikit-learn's implementation, but the fundamental inefficiency of the k-means algorithm.Radiochemistry
About 3000 data points with 17 dimensional space with k=400Differ
yes, well, the algorithm is O(n^(dk+1)) where n is the number of observatons, d is the dimensionality, and k is kRadiochemistry
You should consider whether it really makes sense to put 3000 points into 400 clusters. That's only 7.5 points per cluster on average. You may want a much smaller k.Cruzcruzado
S
16

The main solution in scikit-learn is to switch to mini-batch kmeans which reduces computational resources a lot. To some extent it is an analogous approach to SGD (Stochastic Gradient Descent) vs. GD (Gradient Descent) for optimising non-linear functions - SGD is usually faster (in terms of computational cycles needed to converge to the local solution). Note that this introduces more variance to the optimisation, thus results might be harder to reproduce (optimisation will end up in different solutions more often than "full batch" kmeans).

Secundine answered 1/10, 2017 at 18:52 Comment(5)
@Differ You can find a summary of mini-batch k-means in this paper. I'm not sure, but you may need to have the mini-batch size be greater (or significantly greater) than k for this to work.Cruzcruzado
What rationale do you trade-off Wojciech, for having some achievable relative speedups, but cit "at the cost of lower cluster quality" and "the initialization strategy has less stabilization effect on the solutions, because its computation is done in a random sample instead of using the whole dataset" opens an explicit and unhandled risk of getting stuck in local, not global extreme on real problem-domains' ( non-synthetic ) datasets?Silviasilviculture
K-means always converges to local optima, no matter if one uses whole dataset or mini-batch; fixed initialisation schemes lead to reproducible optimisation to local optimum, not global one. Of course there is a risk in any stochasticity in the process, so empirical analysis is the only thing that can answer how well it works on real problems; paper cited by Jeremy shows 0-4% decrease in final kmeans criterion value.Secundine
What is the quantitatively expressed expected relative reduction in [TIME] and [SPACE] ~ CPU-cycles and processing MEM-footprint, once the k-means process switches into the proposed mini-batch mode? Is it fair to expect and common to achieve speedups in ranges of { 1.01x | 1.1x | 2x | 3x | 5x | 10x | even faster }-relative to using whole dataset with the classical k-means?Silviasilviculture
For some basic analysis on documents clustering please refer to the original work eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf. Feel free to conduct these investigations and verify on the representative datasets of real problems you care about; all that is stated in the answer is that this is the only faster tool looking to solve the OP problem, in the library chosen by OP, thus a valid thing to try on a dataset of interest; of course there are dozens other approximated solutions that could be tested too.Secundine
R
2

scikit-learn 0.23+ now comes with an optimized implementation with a new way to parallelize work across CPUs:

https://scikit-learn.fondation-inria.fr/implementing-a-faster-kmeans-in-scikit-learn-0-23/

Rebarebah answered 29/5, 2020 at 7:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.