Is my python implementation of the Davies-Bouldin Index correct?
Asked Answered
T

5

5

I'm trying to calculate the Davies-Bouldin Index in Python.

Here are the steps the code below tries to reproduce.

5 Steps:

  1. For each cluster, compute euclidean distances between each point to the centroid
  2. For each cluster, compute the average of these distances
  3. For each pair of clusters, compute the euclidean distance between their centroids

Then,

  1. For each pair of clusters, make the sum of the average distances to their respective centroid (computed at step 2) and divide it by the distance separating them (computed at step 3).

Finally,

  1. Compute the mean of all these divisions (= all indexes) to get the Davies-Bouldin index of the whole clustering

Code

def daviesbouldin(X, labels, centroids):

    import numpy as np
    from scipy.spatial.distance import pdist, euclidean
    
    nbre_of_clusters = len(centroids) #Get the number of clusters
    distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
    distances_means = [] #Store the mean of these distances
    DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
    second_cluster_idx = [] #Store index of the second cluster of each pair
    first_cluster_idx = 0 #Set index of first cluster of each pair to 0
    
    # Step 1: Compute euclidean distances between each point of a cluster to their centroid
    for cluster in range(nbre_of_clusters):
        for point in range(X[labels == cluster].shape[0]):
            distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
    
    # Step 2: Compute the mean of these distances
    for e in distances:
        distances_means.append(np.mean(e))
  
    # Step 3: Compute euclidean distances between each pair of centroid
    ctrds_distance = pdist(centroids) 
     
    # Tricky step 4: Compute Davies-Bouldin index of each pair of cluster   
    for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
        second_cluster_idx.append(e)
        if second_cluster_idx[i-1] == nbre_of_clusters - 1:
            first_cluster_idx += 1
        DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
     
    # Step 5: Compute the mean of all DB_indexes   
    print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes)) 

In arguments:

  • X is the data
  • labels, are the labels computed by a clustering algorithm (i.e: kmeans)
  • centroids are the coordinates of each cluster's centroid (i.e: cluster_centers_)

Also, note that I'm using Python 3

QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?

QUESTION2: Is my implementation of step 4 correct?

QUESTION3: Do I need to normalise intra and inter cluster distances ?


Further explanations on Step 4

Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.

At the first iteration:

  • sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 2 (index 1 of distances_means)
  • divides this sum by the distance between the 2 clusters (index 0 of ctrds_distance)

At the second iteration:

  • sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 3 (index 2 of distances_means)
  • divides this sum by the distance between the 2 clusters (index 1 of ctrds_distance)

and so on...

With the example of 10 clusters, the full iteration process should look like this:

intra-cluster distance intra-cluster distance       distance between their
      of cluster:             of cluster:           centroids(storage num):
         0           +             1            /             0
         0           +             2            /             1
         0           +             3            /             2
         0           +             4            /             3
         0           +             5            /             4
         0           +             6            /             5
         0           +             7            /             6
         0           +             8            /             7
         0           +             9            /             8
         1           +             2            /             9
         1           +             3            /             10
         1           +             4            /             11
         1           +             5            /             12
         1           +             6            /             13
         1           +             7            /             14
         1           +             8            /             15
         1           +             9            /             16
         2           +             3            /             17
         2           +             4            /             18
         2           +             5            /             19
         2           +             6            /             20
         2           +             7            /             21
         2           +             8            /             22
         2           +             9            /             23
         3           +             4            /             24
         3           +             5            /             25
         3           +             6            /             26
         3           +             7            /             27
         3           +             8            /             28
         3           +             9            /             29
         4           +             5            /             30
         4           +             6            /             31
         4           +             7            /             32
         4           +             8            /             33
         4           +             9            /             34
         5           +             6            /             35
         5           +             7            /             36
         5           +             8            /             37
         5           +             9            /             38
         6           +             7            /             39
         6           +             8            /             40
         6           +             9            /             41
         7           +             8            /             42
         7           +             9            /             43
         8           +             9            /             44

The problem here is I'm not quite sure that the index of distances_means matches the index of ctrds_distance.

In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.

In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.

Trophozoite answered 30/12, 2017 at 18:8 Comment(5)
Belongs to codereview.SE not stackoverflow.Tripetalous
@Anony-Mousse Thanks fot the suggestion. I'm going to ask about this implementation right away.Trophozoite
@Anony-Mousse No, it does not. To be on-topic at Code Review, the code should work to the best of the OP's knowledge. Since OP clearly has no idea, it's not fit for review. It's cross-posted there now and will likely be closed in the near future.Filamentous
What happens if you have noise labels (-1) ?Wessex
One way to check correctness is trying it out on a few examples and compare with expected results. Is the code able to reproduce that?Bayard
T
2

Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    db = []

    for i in range(n_cluster):
        for j in range(n_cluster):
            if j != i:
                db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))

    return(np.max(db) / n_cluster)

Answering my own questions:

  • the counter on the first draft (step 4) was correct BUT irrelevant
  • there's no need to normalise intra and inter cluster distances
  • there was a mistake when calculating Euclidean distances

Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.

Trophozoite answered 10/1, 2018 at 14:7 Comment(6)
here, euclidean is nothing but euclidean_distances from sklearn.metric.paiwise, right ?Gunny
looking back at the code it seems i was using scipy. from scipy.spatial.distance import pdist, euclideanTrophozoite
cluster_k = [X[labels == k] for k in range(n_cluster)] TypeError: only integer arrays with one element can be converted to an index I am getting this errorMercator
I tried this implementation, however what happens when you have noise labels (e.g. -1)?...Wessex
This appears to make a mistake in getting the max across the entire centroid distance matrix, rather than the max distance for each cluster.Coop
excellent work! I've voted up your answer and question. would you please provide a link to the dataset you use?Obvert
F
1

thank you for your implementation. I just have one question: Is there missing a division in the last row. In the last step the value of max(db) should be divided by the implemented number of clusters.

def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels)) 
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)] 
centroids = [np.mean(k, axis = 0) for k in cluster_k] 
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []

for i in range(n_cluster):
    for j in range(n_cluster):
        if j != i:
            db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))

Maybe I oversee that division because I'm new to Python. But in my graphics (I'm iterating over a range of clusters) the value of DB.max is very low at the beginning and increases afterwards. After the Scaling by the number of clusters the graph looks better (high DB.max value at the beginning and constantly falling with increasing number of clusters).

Best regards

Fraught answered 25/1, 2018 at 16:47 Comment(2)
I think you're right, the index should indeed decrease as the within-cluster distortion decreases. Thank you for spotting that oversight, the code is now updated. (btw If you're into clustering evaluation you might take a look at the Gap Statistic method which I find very efficient overall.)Trophozoite
cluster_k = [X[labels == k] for k in range(n_cluster)] TypeError: only integer arrays with one element can be converted to an index I am getting this error I am getting this errorMercator
B
1

Thanks for the code and revision - really helped me to get started. The shorter, faster version is not entirely correct. I amended it to correctly average the dispersion scores of the most similar cluster for each cluster.

See https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure for original algorithm and explanation:

The DBI is the average of the similarity measures of each cluster with its most similar cluster.

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]

    # calculate cluster dispersion
    S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    Ri = []

    for i in range(n_cluster):
        Rij = []
        # establish similarity between each cluster and all other clusters
        for j in range(n_cluster):
            if j != i:
                r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
                Rij.append(r)
         # select Ri value of most similar cluster
         Ri.append(max(Rij)) 

    # get mean of all Ri values    
    dbi = np.mean(Ri)

    return dbi
Blindly answered 10/6, 2018 at 10:37 Comment(0)
D
1

DB Index is implemented in sk-learn starting from version 0.20. Have checked the results of implemented function and code written by me. Got the same result:

def DaviesBouldin(X, labels):
"""
:param X: array of data
:param labels: specific cluster for variable
:return: DB Index devided by number of clusters
"""
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
    dbij = []
    for j in range(n_cluster):
        if j != i:
            dbij.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
    db.append(max(dbij))

dbi = np.mean(db)

return dbi
Decry answered 13/8, 2022 at 7:35 Comment(0)
T
0

This one is 20x faster than the below code, all computation is done in numpy.

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty((n,1))
    dbi_arr = np.empty((n,n))
    mask_arr = np.invert(np.eye(n, dtype='bool'))
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # compute pairwise centroid distances, make diagonal elements non-zero
    centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
    # compute pairwise sigma sums
    sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
    # divide 
    dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
    # get mean of max of off-diagonal elements
    dbi_arr = np.where(mask_arr, dbi_arr, 0)
    dbi = np.mean(np.max(dbi_arr, axis=1))
    return dbi


Here's an implementation using numpy 1.14, scipy 1.1.0, and python 3. Not much computational speed improvement but should have a slightly smaller memory footprint.

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.

    Arguments
    ----------
    X : 2D array (n_samples, embed_dim)
    Vector for each example.

    y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
    True labels for each example.

    Returns
    ----------
    dbi : float
        Calculated Davies-Bouldin index.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty(n)
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # loop over non-duplicate cluster pairs
    dbi = 0
    for i in range(n):
        max_Rij = 0
        for j in range(n):
            if j != i:
                Rij = np.divide(sigma_arr[i] + sigma_arr[j], 
                                euclidean(centroid_arr[i,...], centroid_arr[j,...]))
                if Rij > max_Rij:
                    max_Rij = Rij
        dbi += max_Rij
    return dbi/n
Temporal answered 18/9, 2018 at 22:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.