Finding the center of a cluster

Asked 10/8, 2009 at 8:52 Answered 2/5 at 16:4

Solved algorithm cluster-analysis data-mining

I have the following problem - made abstract to bring out the key issues.

I have 10 points each which is some distance from the other. I want to

be able to find the center of the cluster i.e. the point for which the pairwise distance to each other point is minimised,
let p(j) ~ p(k) represent the pairwise distance beteen points j and k
p(i) is center-point of the cluster iff p(i) s.t. min[sum(p(j)~p(k))] for all 0 < j,k <= n where we have n points in the cluster
determine how to split the cluster in to two clusters once the number of data points in the cluster goes above some threshold t.

This is not euclidean space. But the distances can be summarised as follows - p(i) is point i:

       p(1)    p(2)    p(3)    p(4)    p(5)    p(6)    p(7)    p(8)    p(9)    p(10)
p(1)    0       2       1       3       2       3       3       2       3        4
p(2)    2       0       1       3       2       3       3       2       3        4
p(3)    1       1       0       2       0       1       2       1       2        3
p(4)    3       3       2       0       1       2       3       2       3        4      
p(5)    2       2       1       1       0       1       2       1       2        3   
p(6)    3       3       2       2       1       0       3       2       3        4   
p(7)    3       3       2       3       2       3       0       1       2        3  
p(8)    2       2       1       2       1       2       1       0       1        2 
p(9)    3       3       2       3       2       3       2       1       0        1
p(10)   4       4       3       4       3       4       3       2       1        0

How would I calculate which is the center point of this cluster?

Fini answered 10/8, 2009 at 8:52 Comment(5)

Please define "center of the cluster" – Mountaineering 10/8, 2009 at 9:0

@ Nifle - done ... do you have any ideas – Fini 10/8, 2009 at 9:4

The application has to do with clustering concepts - my application is a semantic data store - the points represent abstract objects. I want to cluster objects to be able to determine "concepts" – Fini 10/8, 2009 at 9:7

Do you mean: choose i that minimizes [sum(p(i)~p(j)] for 0 < j <= n, so that p(i) is the center? If so, I think that's your answer to part 1; otherwise I'm not sure what you mean. – Delirium 10/8, 2009 at 9:23

@Jim - in words, we want to minimise the pairwise distances for all pairs of points in a cluster. The point that gives the minimum is the center point. – Fini 10/8, 2009 at 17:34

As far as I understand this looks like K Means Clustering, and what you are looking for is usually known as 'Medoids'.

See here: http://en.wikipedia.org/wiki/Medoids or here: http://en.wikipedia.org/wiki/K-medoids

Granada answered 10/8, 2009 at 9:5 Comment(2)

Upvoted: this indeed looks like the way to go for non-Euclidean metrics. But it still requires at least that the triangle inequality holds; I confess I'm not patient enough to verify that for Ankur's example. – Delirium 10/8, 2009 at 9:38

@ Jim, the triangle inequality does hold in my "metric space", so then this should work. – Fini 10/8, 2009 at 17:35

I may be about to have that frisson that comes just before displaying utter stupidity. But doesn't this yield easily to brute force? In Python:

distances = [
[ 0 , 2 , 1 , 3 , 2 , 3 , 3 , 2 , 3 , 4 , ],
[ 2 , 0 , 1 , 3 , 2 , 3 , 3 , 2 , 3 , 4 , ],
[ 1 , 1 , 0 , 2 , 0 , 1 , 2 , 1 , 2 , 3 , ],
[ 3 , 3 , 2 , 0 , 1 , 2 , 3 , 2 , 3 , 4 , ],
[ 2 , 2 , 1 , 1 , 0 , 1 , 2 , 1 , 2 , 3 , ],
[ 3 , 3 , 2 , 2 , 1 , 0 , 3 , 2 , 3 , 4 , ],
[ 3 , 3 , 2 , 3 , 2 , 3 , 0 , 1 , 2 , 3 , ],
[ 2 , 2 , 1 , 2 , 1 , 2 , 1 , 0 , 1 , 2 , ],
[ 3 , 3 , 2 , 3 , 2 , 3 , 2 , 1 , 0 , 1 , ],
[ 4 , 4 , 3 , 4 , 3 , 4 , 3 , 2 , 1 , 0 , ],
]

currentMinimum = 99999

for point in range ( 10 ) :
    distance_sum = 0
    for second_point in range ( 10 ) :
        if point == second_point : continue
        distance_sum += distances [ point ] [ second_point ]
    print '>>>>>', point, distance_sum 

    if distance_sum < currentMinimum :
        currentMinimum = distance_sum 
        centre = point

print centre

Spermatozoon answered 11/8, 2009 at 14:2 Comment(1)

Hi Bill, I have come to a similar conclusion. Once you have your cluster and you want to chose the centre of a cluster, this is pretty much the way to do it. What I am doing is this: I start with a single cluster because I start with 1 data point, as more are added and my cluster becomes greater than some threshold t, I split it. I then choose the two furtherest points as centres for the new clusters. Each point is then allocated to one of the two points depending on which is closer. Then the actual centre in each cluster is computed. – Fini 12/8, 2009 at 7:53

find median or average values of all distances. = avgAll
For each p, find average distance to other machines. = avgP(i)
Pick the closer one as center. avgAll ~= avgP(i)

b) no idea for now..

maybe for each p, find the closer machine.

by this logic make a graph.

than somehow (i dont know yet) divide the graph

Brachiate answered 10/8, 2009 at 8:59 Comment(0)

What you're trying to do, or at least (b) belongs to Cluster Analysis. A branch of mathematics / statistics / econometrics where datapoints (e.g. points in n-dimensional space) are divided among groups or clusters. How to do this is not a trivial questions, there are many, many possible ways.

Cozza answered 10/8, 2009 at 14:23 Comment(0)

If you treat the points as a distribution, you can get the median/mean. Then, using the distance metric, you can get the the point/sample closest to the median/mean (it should have the least distance with the median/mean). This may give you more than 1 point/sample to be at the 'center'. Statistically, that may have a greater significance. Ultimately depends on what you are trying to do with it.

Fefeal answered 2/5 at 16:4 Comment(0)

Recommended topics

Hot tags