how to choose initial centroids for k-means clustering
Asked Answered
B

4

8

I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set:

A,1,1
B,2,1
C,4,4
D,4,5

I need to create two different clusters. How do i start with the centroids?

Berliner answered 12/3, 2016 at 0:15 Comment(0)
P
7

You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows:

  • Choose one center uniformly at random from among the data points.
  • For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  • Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2 (You can use scipy.stats.rv_discrete for that).
  • Repeat Steps 2 and 3 until k centers have been chosen.
  • Now that the initial centers have been chosen, proceed using standard k-means clustering.
Pomology answered 12/3, 2016 at 0:33 Comment(2)
Choose centers uniformly at random from among the data points. I didnt get this part. how do i choose it on my data set?Berliner
If you choose a random element, you sample it according to some distribution. Uniformly means simply that you sample it from a set where drawing each element is equally probable. So in your case You can do something like random.sample(set('ABCD'), 1).Pomology
S
3

The standard initialization is to simply

  • choose k random instances.

there are many more methods (such as k-means++) but they often don't consistently yield that much better results than this baseline. Methods such as k-means++ sometimes work well, but also very often don't yield any improvement; but take a lot of extra time to compute.

Stenography answered 13/3, 2016 at 23:24 Comment(0)
S
1

If a dataset is small as it is in your case K- means itself chooses random distinct clusters and then calculates centroids repeatedly to optimize the distance between centroid and points.

However, if a dataset is large then instead of initial randomization of clusters there is a simple approach called sharding which can be done as it reduces the no of iterations required to optimize clustering and thereby saving time.

you can apply sharding as it is explained in detail here

Sharding in k means

Surfacetosurface answered 28/2, 2018 at 14:9 Comment(0)
M
0

One standard initialization is to assign each data point to cluster at random, and then just calculate the means of those random clusters.

Another is to just pick k random data points, where k is the number of clusters, and those are your means. This is sometimes called the Forgy method.

Magnetograph answered 12/3, 2016 at 0:31 Comment(1)
Random cluster assignments is actually one of the worst methods. Because on average, all centers will be the same.Stenography

© 2022 - 2024 — McMap. All rights reserved.