Python, Scikit-learn, K-means: What does the parameter n_init actually do? [duplicate]
Asked Answered
L

1

1

I'm a beginner for Python. Now, I'm trying to understand what the parameter n_init from sklearn.cluster.KMeans

From the documentation:

n_init : int, default: 10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

At first, I thought it means the number of time the code would run until I found this helpful question, and I realized that's what max_iter do.

What exactly does the parameter n_init do? I really don't understand it.

Lubricous answered 22/9, 2017 at 7:47 Comment(2)
Since the starting points are randomized, n_init states how many different sets of random points the algorithm should use. It then gives the best run in terms of inertia (how little the algo was moving at the end of the run -small steps --> closer to best solution)Berkie
It will initialize the centroids for clusters randomly this many times. Depending on the initial value of centroids, the clusters formed may be different.Scammon
T
6

In K-means the initial placement of centroid plays a very important role in it's convergence. Sometimes, the initial centroids are placed in a such a way that during consecutive iterations of K-means the clusters the clusters keep on changing drastically and even before the convergence condition may occur, max_iter is reached and we are left with incorrect cluster. Hence, the clusters obtained in such may not be correct. To overcome this problem, this parameter is introduced. The value of n_iter basically determines how many different sets of randomly chosen centroids, should the algorithm use. For each different set of points, a comparision is made about how much distance did the clusters move, i.e. if the clusters travelled small distances than it is highly likely that we are closest to ground truth/best solution. The points which provide the best performance and their respective run along with all the cluster labels are returned.

If you are interested, you can also look at k-means++ algorithm designed specifically to deal with this problem.

You can also look at this link for more details about the initial centroids matter.

Tricorn answered 22/9, 2017 at 9:53 Comment(3)
if someone uses n_init=10 and random_state = 1234, then then answer does not make sense. How can you initialize randomly 10 times the centroids having a fixed random_state ???Softshoe
@serafeim it basically means to select 10 * (no. of centroids) uniformly, with random state set to 1234. Does this help in clearing your query ?Tricorn
@Softshoe n_init determines the total runs\initializations\random numbers used, while random_state determines the initial random number generator seed - seeded before these runs begin, and thus makes sure the same 10 random numbers are generated across kmeans trials.Microsurgery

© 2022 - 2024 — McMap. All rights reserved.