Is my python implementation of the Davies-Bouldin Index correct?

Asked 30/12, 2017 at 18:8 Answered 13/8, 2022 at 7:35

Solved python statistics cluster-analysis metrics data-science

I'm trying to calculate the Davies-Bouldin Index in Python.

Here are the steps the code below tries to reproduce.

5 Steps:

For each cluster, compute euclidean distances between each point to the centroid
For each cluster, compute the average of these distances
For each pair of clusters, compute the euclidean distance between their centroids

Then,

For each pair of clusters, make the sum of the average distances to their respective centroid (computed at step 2) and divide it by the distance separating them (computed at step 3).

Finally,

Compute the mean of all these divisions (= all indexes) to get the Davies-Bouldin index of the whole clustering

Code

def daviesbouldin(X, labels, centroids):

    import numpy as np
    from scipy.spatial.distance import pdist, euclidean
    
    nbre_of_clusters = len(centroids) #Get the number of clusters
    distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
    distances_means = [] #Store the mean of these distances
    DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
    second_cluster_idx = [] #Store index of the second cluster of each pair
    first_cluster_idx = 0 #Set index of first cluster of each pair to 0
    
    # Step 1: Compute euclidean distances between each point of a cluster to their centroid
    for cluster in range(nbre_of_clusters):
        for point in range(X[labels == cluster].shape[0]):
            distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
    
    # Step 2: Compute the mean of these distances
    for e in distances:
        distances_means.append(np.mean(e))
  
    # Step 3: Compute euclidean distances between each pair of centroid
    ctrds_distance = pdist(centroids) 
     
    # Tricky step 4: Compute Davies-Bouldin index of each pair of cluster   
    for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
        second_cluster_idx.append(e)
        if second_cluster_idx[i-1] == nbre_of_clusters - 1:
            first_cluster_idx += 1
        DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
     
    # Step 5: Compute the mean of all DB_indexes   
    print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))

In arguments:

X is the data
labels, are the labels computed by a clustering algorithm (i.e: kmeans)
centroids are the coordinates of each cluster's centroid (i.e: cluster_centers_)

Also, note that I'm using Python 3

QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?

QUESTION2: Is my implementation of step 4 correct?

QUESTION3: Do I need to normalise intra and inter cluster distances ?

Further explanations on Step 4

Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.

At the first iteration:

sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 2 (index 1 of distances_means)
divides this sum by the distance between the 2 clusters (index 0 of ctrds_distance)

At the second iteration:

sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 3 (index 2 of distances_means)
divides this sum by the distance between the 2 clusters (index 1 of ctrds_distance)

and so on...

With the example of 10 clusters, the full iteration process should look like this:

intra-cluster distance intra-cluster distance       distance between their
      of cluster:             of cluster:           centroids(storage num):
         0           +             1            /             0
         0           +             2            /             1
         0           +             3            /             2
         0           +             4            /             3
         0           +             5            /             4
         0           +             6            /             5
         0           +             7            /             6
         0           +             8            /             7
         0           +             9            /             8
         1           +             2            /             9
         1           +             3            /             10
         1           +             4            /             11
         1           +             5            /             12
         1           +             6            /             13
         1           +             7            /             14
         1           +             8            /             15
         1           +             9            /             16
         2           +             3            /             17
         2           +             4            /             18
         2           +             5            /             19
         2           +             6            /             20
         2           +             7            /             21
         2           +             8            /             22
         2           +             9            /             23
         3           +             4            /             24
         3           +             5            /             25
         3           +             6            /             26
         3           +             7            /             27
         3           +             8            /             28
         3           +             9            /             29
         4           +             5            /             30
         4           +             6            /             31
         4           +             7            /             32
         4           +             8            /             33
         4           +             9            /             34
         5           +             6            /             35
         5           +             7            /             36
         5           +             8            /             37
         5           +             9            /             38
         6           +             7            /             39
         6           +             8            /             40
         6           +             9            /             41
         7           +             8            /             42
         7           +             9            /             43
         8           +             9            /             44

The problem here is I'm not quite sure that the index of distances_means matches the index of ctrds_distance.

In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.

In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.

Trophozoite answered 30/12, 2017 at 18:8 Comment(5)

Belongs to codereview.SE not stackoverflow. – Tripetalous 30/12, 2017 at 22:30

@Anony-Mousse Thanks fot the suggestion. I'm going to ask about this implementation right away. – Trophozoite 30/12, 2017 at 22:44

@Anony-Mousse No, it does not. To be on-topic at Code Review, the code should work to the best of the OP's knowledge. Since OP clearly has no idea, it's not fit for review. It's cross-posted there now and will likely be closed in the near future. – Filamentous 30/12, 2017 at 22:58

What happens if you have noise labels (-1) ? – Wessex 9/7, 2018 at 13:23

One way to check correctness is trying it out on a few examples and compare with expected results. Is the code able to reproduce that? – Bayard 13/8, 2022 at 7:40

Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    db = []

    for i in range(n_cluster):
        for j in range(n_cluster):
            if j != i:
                db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))

    return(np.max(db) / n_cluster)

Answering my own questions:

the counter on the first draft (step 4) was correct BUT irrelevant
there's no need to normalise intra and inter cluster distances
there was a mistake when calculating Euclidean distances

Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.

Trophozoite answered 10/1, 2018 at 14:7 Comment(6)

here, euclidean is nothing but euclidean_distances from sklearn.metric.paiwise, right ? – Gunny 28/2, 2018 at 11:46

looking back at the code it seems i was using scipy. from scipy.spatial.distance import pdist, euclidean – Trophozoite 1/3, 2018 at 12:43

cluster_k = [X[labels == k] for k in range(n_cluster)] TypeError: only integer arrays with one element can be converted to an index I am getting this error – Mercator 13/3, 2018 at 7:9

I tried this implementation, however what happens when you have noise labels (e.g. -1)?... – Wessex 9/7, 2018 at 13:22

This appears to make a mistake in getting the max across the entire centroid distance matrix, rather than the max distance for each cluster. – Coop 19/11, 2018 at 6:27

excellent work! I've voted up your answer and question. would you please provide a link to the dataset you use? – Obvert 21/6, 2019 at 2:56

thank you for your implementation. I just have one question: Is there missing a division in the last row. In the last step the value of max(db) should be divided by the implemented number of clusters.

def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels)) 
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)] 
centroids = [np.mean(k, axis = 0) for k in cluster_k] 
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []

for i in range(n_cluster):
    for j in range(n_cluster):
        if j != i:
            db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))

Maybe I oversee that division because I'm new to Python. But in my graphics (I'm iterating over a range of clusters) the value of DB.max is very low at the beginning and increases afterwards. After the Scaling by the number of clusters the graph looks better (high DB.max value at the beginning and constantly falling with increasing number of clusters).

Best regards

Fraught answered 25/1, 2018 at 16:47 Comment(2)

I think you're right, the index should indeed decrease as the within-cluster distortion decreases. Thank you for spotting that oversight, the code is now updated. (btw If you're into clustering evaluation you might take a look at the Gap Statistic method which I find very efficient overall.) – Trophozoite 26/1, 2018 at 0:15

cluster_k = [X[labels == k] for k in range(n_cluster)] TypeError: only integer arrays with one element can be converted to an index I am getting this error I am getting this error – Mercator 13/3, 2018 at 7:10

Thanks for the code and revision - really helped me to get started. The shorter, faster version is not entirely correct. I amended it to correctly average the dispersion scores of the most similar cluster for each cluster.

See https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure for original algorithm and explanation:

The DBI is the average of the similarity measures of each cluster with its most similar cluster.

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]

    # calculate cluster dispersion
    S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    Ri = []

    for i in range(n_cluster):
        Rij = []
        # establish similarity between each cluster and all other clusters
        for j in range(n_cluster):
            if j != i:
                r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
                Rij.append(r)
         # select Ri value of most similar cluster
         Ri.append(max(Rij)) 

    # get mean of all Ri values    
    dbi = np.mean(Ri)

    return dbi

Blindly answered 10/6, 2018 at 10:37 Comment(0)

DB Index is implemented in sk-learn starting from version 0.20. Have checked the results of implemented function and code written by me. Got the same result:

def DaviesBouldin(X, labels):
"""
:param X: array of data
:param labels: specific cluster for variable
:return: DB Index devided by number of clusters
"""
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
    dbij = []
    for j in range(n_cluster):
        if j != i:
            dbij.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
    db.append(max(dbij))

dbi = np.mean(db)

return dbi

Decry answered 13/8, 2022 at 7:35 Comment(0)

This one is 20x faster than the below code, all computation is done in numpy.

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty((n,1))
    dbi_arr = np.empty((n,n))
    mask_arr = np.invert(np.eye(n, dtype='bool'))
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # compute pairwise centroid distances, make diagonal elements non-zero
    centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
    # compute pairwise sigma sums
    sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
    # divide 
    dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
    # get mean of max of off-diagonal elements
    dbi_arr = np.where(mask_arr, dbi_arr, 0)
    dbi = np.mean(np.max(dbi_arr, axis=1))
    return dbi

Here's an implementation using numpy 1.14, scipy 1.1.0, and python 3. Not much computational speed improvement but should have a slightly smaller memory footprint.

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.

    Arguments
    ----------
    X : 2D array (n_samples, embed_dim)
    Vector for each example.

    y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
    True labels for each example.

    Returns
    ----------
    dbi : float
        Calculated Davies-Bouldin index.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty(n)
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # loop over non-duplicate cluster pairs
    dbi = 0
    for i in range(n):
        max_Rij = 0
        for j in range(n):
            if j != i:
                Rij = np.divide(sigma_arr[i] + sigma_arr[j], 
                                euclidean(centroid_arr[i,...], centroid_arr[j,...]))
                if Rij > max_Rij:
                    max_Rij = Rij
        dbi += max_Rij
    return dbi/n

Temporal answered 18/9, 2018 at 22:34 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags