I'm trying to calculate the Davies-Bouldin Index in Python.
Here are the steps the code below tries to reproduce.
5 Steps:
- For each cluster, compute euclidean distances between each point to the centroid
- For each cluster, compute the average of these distances
- For each pair of clusters, compute the euclidean distance between their centroids
Then,
- For each pair of clusters, make the sum of the average distances to their respective centroid (computed at step 2) and divide it by the distance separating them (computed at step 3).
Finally,
- Compute the mean of all these divisions (= all indexes) to get the Davies-Bouldin index of the whole clustering
Code
def daviesbouldin(X, labels, centroids):
import numpy as np
from scipy.spatial.distance import pdist, euclidean
nbre_of_clusters = len(centroids) #Get the number of clusters
distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
distances_means = [] #Store the mean of these distances
DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
second_cluster_idx = [] #Store index of the second cluster of each pair
first_cluster_idx = 0 #Set index of first cluster of each pair to 0
# Step 1: Compute euclidean distances between each point of a cluster to their centroid
for cluster in range(nbre_of_clusters):
for point in range(X[labels == cluster].shape[0]):
distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
# Step 2: Compute the mean of these distances
for e in distances:
distances_means.append(np.mean(e))
# Step 3: Compute euclidean distances between each pair of centroid
ctrds_distance = pdist(centroids)
# Tricky step 4: Compute Davies-Bouldin index of each pair of cluster
for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
second_cluster_idx.append(e)
if second_cluster_idx[i-1] == nbre_of_clusters - 1:
first_cluster_idx += 1
DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
# Step 5: Compute the mean of all DB_indexes
print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))
In arguments:
X
is the datalabels
, are the labels computed by a clustering algorithm (i.e: kmeans)centroids
are the coordinates of each cluster's centroid (i.e:cluster_centers_
)
Also, note that I'm using Python 3
QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?
QUESTION2: Is my implementation of step 4 correct?
QUESTION3: Do I need to normalise intra and inter cluster distances ?
Further explanations on Step 4
Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.
At the first iteration:
- sums intra-distances mean of cluster 1 (index 0 of
distances_means
) and intra-distances mean of cluster 2 (index 1 ofdistances_means
) - divides this sum by the distance between the 2 clusters (index 0 of
ctrds_distance
)
At the second iteration:
- sums intra-distances mean of cluster 1 (index 0 of
distances_means
) and intra-distances mean of cluster 3 (index 2 ofdistances_means
) - divides this sum by the distance between the 2 clusters (index 1 of
ctrds_distance
)
and so on...
With the example of 10 clusters, the full iteration process should look like this:
intra-cluster distance intra-cluster distance distance between their
of cluster: of cluster: centroids(storage num):
0 + 1 / 0
0 + 2 / 1
0 + 3 / 2
0 + 4 / 3
0 + 5 / 4
0 + 6 / 5
0 + 7 / 6
0 + 8 / 7
0 + 9 / 8
1 + 2 / 9
1 + 3 / 10
1 + 4 / 11
1 + 5 / 12
1 + 6 / 13
1 + 7 / 14
1 + 8 / 15
1 + 9 / 16
2 + 3 / 17
2 + 4 / 18
2 + 5 / 19
2 + 6 / 20
2 + 7 / 21
2 + 8 / 22
2 + 9 / 23
3 + 4 / 24
3 + 5 / 25
3 + 6 / 26
3 + 7 / 27
3 + 8 / 28
3 + 9 / 29
4 + 5 / 30
4 + 6 / 31
4 + 7 / 32
4 + 8 / 33
4 + 9 / 34
5 + 6 / 35
5 + 7 / 36
5 + 8 / 37
5 + 9 / 38
6 + 7 / 39
6 + 8 / 40
6 + 9 / 41
7 + 8 / 42
7 + 9 / 43
8 + 9 / 44
The problem here is I'm not quite sure that the index of distances_means
matches the index of ctrds_distance
.
In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.
In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.