Is sklearn.cluster.KMeans sensative to data point order?
Asked Answered
C

1

0

As noted in the answer to this post about feature scaling, some(all?) implementations of KMeans are sensitive to the order of features data points. Based on the sklearn.cluster.KMeans documentation, n_init only changes the initial position of the centroid. This would mean that one must loop over a few shuffles of features data points to test if this is a problem. My questions are as follows:

  1. Is the scikit-learn implementation sensitive to the ordering as the post suggest?
  2. Does n_init take care of it for me?
  3. If I am to to it myself should I take the best based on minimum inertia or take an average as suggested here?
  4. Is there a good rule to know how many shuffle permutations is sufficient based on the number of data points?

UPDATE: The question initially asked about feature(column) order which is not an issue. This was a misinterpretation of the term "objects" in the linked post. It has been updated to ask about the data points (rows) order.

Construct answered 2/12, 2017 at 5:12 Comment(0)
F
3

K-means is not sensitive to feature order.

The post you refer to taken about scale, not order.

If you look at the kmeans equations, it should be obvious that the order does not matter.

There has been research (van Luxbourg, if I recall correctly) that essentially says that if there is a good kmeans result, then it must be easy to find. If you get very different results when running kmeans multiple times, then none of the results is good.

There are "n choose k" possible initializations. While they can't be all bad, n_iter will only try very few of them. So there is no guarantee to find the "best".the function will return the one with lowest SSQ, but that does not mean this is the most useful result in the end, unless you only care about SSQ.

Forwent answered 2/12, 2017 at 10:22 Comment(3)
I realize the question is about the scale but the answer has a long footnote suggesting that the feature order matters. It is also stated in the answer to the other link. They use the terms "order of objects" and "order of observations", respectively. Do they mean the order of data point not the features? The equation would imply this is not the case but the answerer said it had to do with the implementation.Construct
The order of rows matters for kmeans, in particular for initialization.Forwent
Thank you! Could you please update the answer to respond to my sub-questions? ie how this effects the scikit-learn implementation and if n_init is sufficient to account for this issue?Construct

© 2022 - 2024 — McMap. All rights reserved.