How to compute distances between centroids and data matrix (for kmeans algorithm)

T

2

10

I am a student of clustering and R. In order to obtain a better grip of both I would like to compute the distance between centroids and my xy-matrix for each iteration till it "converges". How can I solve for step 2 and 3 using R?

library(fields)
x <- c(3,6,8,1,2,2,6,6,7,7,8,8)
y <- c(5,2,3,5,4,6,1,8,3,6,1,7)

df <- data.frame(x,y) initial matrix
a  <- c(3,6,8)
b  <- c(5,2,3)

df1 <- data.frame(a,b) # initial centroids

Here is what I want to do:

I0 <- t(rdist(df, df1)) after zero iteration
Cluster objects based on minimum distance
Determining the centroids based on the cluster average
Repetition with I1

I tried the kmeans function. But for some reasons it produces those centroids which have to come up at the end. That is I defined the start of:

start   <- matrix(c(3,5,6,2,8,3), 3, byrow = TRUE)
cluster <- kmeans(df,centers = start, iter.max = 1) # one iteration

kmeans doesn't allow me to track the movement of the centroids. Therefore I would like to do it "manually" by applying step 2 & 3 using R.

Tauten answered 22/11, 2014 at 20:49 Comment(4)

I suggest you narrow down the question to a specific section. – Pinery 23/11, 2014 at 8:25

I hope that is specific?! – Tauten 23/11, 2014 at 9:39

OK, now show us what you've tried and exactly where things go awry. – Pinery 23/11, 2014 at 9:40

I hope that is clear. Not sure whethet I have to specified it. – Tauten 23/11, 2014 at 9:55

D

27

Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers").

For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix to all the "centers".

Here is such a function:

myEuclid <- function(points1, points2) {
    distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
    for(i in 1:nrow(points2)) {
        distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
    }
    distanceMatrix
}

points1 is the data matrix with points as rows and dimensions as columns. points2 is the matrix of centers (points as rows again). The first line of code just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there are centers). So the point i,j in the result matrix will be the distance from the ith point to the jth center.

Then the for loop iterates over all centers. For each center it computes the euclidean distance from each point to the current center and returns the result. This line here: sqrt(rowSums(t(t(points1)-points2[i,])^2)) is euclidean distance. Inspect it closer and look up the formula if you have any troubles with that. (the transposes there are mainly done to make sure subtraction is being done row-wise).

Now you can also implement k-means algorithm:

myKmeans <- function(x, centers, distFun, nItter=10) {
    clusterHistory <- vector(nItter, mode="list")
    centerHistory <- vector(nItter, mode="list")

    for(i in 1:nItter) {
        distsToCenters <- distFun(x, centers)
        clusters <- apply(distsToCenters, 1, which.min)
        centers <- apply(x, 2, tapply, clusters, mean)
        # Saving history
        clusterHistory[[i]] <- clusters
        centerHistory[[i]] <- centers
    }

    list(clusters=clusterHistory, centers=centerHistory)
}

As you can see it's also a very simple function - it takes data matrix, centers, your distance function (the one defined above) and number of wanted iterations.

The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center. Which is a basic k-means algorithm).

Let's try it out. Define some random points (in 2d, so number of columns = 2)

mat <- matrix(rnorm(100), ncol=2)

Assign 5 random points from that matrix as initial centers:

centers <- mat[sample(nrow(mat), 5),]

Now run the algorithm:

theResult <- myKmeans(mat, centers, myEuclid, 10)

Here are the centers in the 10th iteration:

theResult$centers[[10]]
        [,1]        [,2]
1 -0.1343239  1.27925285
2 -0.8004432 -0.77838017
3  0.1956119 -0.19193849
4  0.3886721 -1.80298698
5  1.3640693 -0.04091114

Compare that with implemented kmeans function:

theResult2 <- kmeans(mat, centers, 10, algorithm="Forgy")

theResult2$centers
        [,1]        [,2]
1 -0.1343239  1.27925285
2 -0.8004432 -0.77838017
3  0.1956119 -0.19193849
4  0.3886721 -1.80298698
5  1.3640693 -0.04091114

Works fine. Our function however tracks the iterations. We can plot the progress over the first 4 iterations like this:

par(mfrow=c(2,2))
for(i in 1:4) {
    plot(mat, col=theResult$clusters[[i]], main=paste("itteration:", i), xlab="x", ylab="y")
    points(theResult$centers[[i]], cex=3, pch=19, col=1:nrow(theResult$centers[[i]]))
}

Kmeans

Nice.

However this simple design allows for much more. For example if we want to use another kind of distance (not euclidean) we can just use any function that takes data and centers as inputs. Here is one for correlation distances:

myCor <- function(points1, points2) {
    return(1 - ((cor(t(points1), t(points2))+1)/2))
}

And we then can do Kmeans based on those:

theResult <- myKmeans(mat, centers, myCor, 10)

The resulting picture for 4 iterations then looks like this:

enter image description here

Even thou we specified 5 clusters - there were 2 left at the end. That is because for 2 dimensions the correlation can have to values - either +1 or -1. Then when looking for the clusters each point get's assigned to one center, even if it has the same distance to multiple centers - the first one get's chosen.

Anyway this is now getting out of scope. The bottom line is that there are many possible distance metrics and one simple function allows you to use any distance you want and track the results over iterations.

Didi answered 23/11, 2014 at 11:38 Comment(1)

You just nailed it. You exmplained me the whole concepts so nice I wish I could give you more credit. – Tauten 23/11, 2014 at 12:4

R

3

Modified the distance matrix function above (added another loop for no. of points) as the above function displays only the distance of first point from all clusters and not all points, which is what the question is looking for:

myEuclid <- function(points1, points2) {
    distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
    for(i in 1:nrow(points2)) {
        for (j in c(1:dim(t(points1))[2])) {
            
        distanceMatrix[j,i] <- sqrt(rowSums(t(t(points1)[,j]-t(points2[i,]))^2))
            }
    }
    distanceMatrix
}

Do let me know if this works fine!

Runny answered 5/6, 2020 at 11:51 Comment(0)

Recommended topics

Hot tags