Changes of clustering results after each time run in Python scikit-learn
Asked Answered
I

5

25

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:

vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)

Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:

vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)

I appreciate your helps.

Islam answered 18/9, 2014 at 20:28 Comment(0)
J
37

When using k-means, you want to set the random_state parameter in KMeans (see the documentation). Set this to either an int or a RandomState instance.

km = KMeans(n_clusters=number_of_k, init='k-means++', 
            max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)

This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.

I'm not sure about the spectral clustering example though. From the documentation on the random_state parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.

Justice answered 18/9, 2014 at 20:30 Comment(7)
Thanks for the hint on Kmean. Does the random state setting really effect the results? For example, if I set random_state=2222, will it change the results much? I'll try also and see. Regarding the spectral clustering, I checked the documentation prior to posting this question but not much about the initiation. They have a random state though that I will set it like the one in Kmean and see how it will change. Thanks again.Islam
@Islam I think it all depends on your data. I haven't used it extensively, but I get the impression that k-means is actually fairly sensitive to the starting value. Of course, that's part of why k-means++ was developed, to get more consistently good starting values, but it's still probably an issue worth considering. Another common strategy is to run it multiple times with different seeds and pick the best one.Justice
By default the implementation actually runs K-Means 10 times and uses the best resulting clustering. So yes, it does effect the output in all but the trivial cases.Doreendorelia
@AndreasMueller if I use 10 n_init and specify the random_state, as n_init=10, random_state=3425 , does this make sense? n_init is the number of time the k-means algorithm will be run with different centroid seeds. Will the centroids change or not due to the fixed random_state ??Terrapin
The random state is set at the beginning, not for each initialization, for the obvious reasons...Cussed
I had similar problems when running the code in different computers. You may want to set up the random generator instance in the random_state to avoid that for def random_state=np.random.RandomState(12345).Dead
@Terrapin [see here](@Terrapin see here.Zusman
C
6

As the others already noted, k-means is usually implemented with randomized initialization. It is intentional that you can get different results.

The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result.

In my opinion, when the results vary highly from run to run, this indicates that the data just does not cluster well with k-means at all. Your results are not much better than random in such a case. If the data is really suited for k-means clustering, the results will be rather stable! If they vary, the clusters may not have the same size, or may be not well separated; and other algorithms may yield better results.

Cussed answered 3/10, 2014 at 18:17 Comment(2)
if I use n_init=10 and specify the random_state, as n_init=10, random_state=0 , does this make sense? n_init is the number of time the k-means algorithm will be run with different centroid seeds. Will the centroids change or not due to the fixed random_state??Terrapin
@Terrapin see here.Zusman
C
1

I had a similar issue, but it's that I wanted the data set from another distribution to be clustered the same way as the original data set. For example, all color images of the original data set were in the cluster 0 and all gray images of the original data set were in the cluster 1. For another data set, I want color images / gray images to be in cluster 0 and cluster 1 as well.

Here is the code I stole from a Kaggler - in addition to set the random_state to a seed, you use the k-mean model returned by KMeans for clustering the other data set. This works reasonably well. However, I can't find the official scikit-Learn document saying that.

# reference - https://www.kaggle.com/kmader/normalizing-brightfield-stained-and-fluorescence
from sklearn.cluster import KMeans

seed = 42
def create_color_clusters(img_df,  cluster_count = 2, cluster_maker=None):
    if cluster_maker is None:
        cluster_maker = KMeans(cluster_count, random_state=seed)
        cluster_maker.fit(img_df[['Green', 'Red-Green', 'Red-Green-Sd']])

    img_df['cluster-id'] = np.argmin(cluster_maker.transform(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]),-1)


    return img_df, cluster_maker

# Now K-Mean your images `img_df` to two clusters
img_df, cluster_maker = create_color_clusters(img_df, 2)
# Cluster another set of images using the same kmean-model
another_img_df, _ = create_color_clusters(another_img_df, 2, cluster_maker)

However, even setting random_state to a int seed cannot ensure the same data will always be grouped in the same order across machines. The same data may be clustered as group 0 on one machine and clustered as group 1 on another machine. But at least with the same K-Means model (cluster_maker in my code) we make sure data from another distribution will be clustered in the same way as the original data set.

Cologarithm answered 17/3, 2018 at 17:6 Comment(0)
S
0

Typically when running algorithms with many local minima it's common to take a stochastic approach and run the algorithm many times with different initial states. This will give you multiple results, and the one with the lowest error is usually chosen to be the best result.

When I use K-Means I always run it several times and use the best result.

Stomach answered 25/9, 2014 at 2:37 Comment(0)
D
-1

After a long while of searching and reading, here are my opinions.

  1. using a number >5 or 10 in n_init= to produce best result among those initiations.
  2. different number in random_state= will result in different cluster results, based on my experience, even the data is well-distributed and good, the results may still have little deviation.
  3. Trying many times on a big n_init number and a random_state number can produce consistent as well as good result. The computer seems to prioritize the n_init when processing.
  4. High dimensional data may need to do a Principle Component Analysis (PCA) prior to Kmeans link

Summarized answer: changes of cluster result in each time are natural and unworried. You just need to save the results of each time.

Dynast answered 14/11, 2023 at 8:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.