how to plot a k-distance graph in python
Asked Answered
P

4

9

How do I plot (in python) the distance graph for a given value of min-points in DBSCAN???

I am looking for the knee and corresponding epsilon value.

In the sklearn I do not see any method that return such distances.... Am I missing something?

Porphyritic answered 1/4, 2017 at 18:0 Comment(0)
L
8

You probably want to use the matrix operations provided by numpy to speed up your distance matrix calculation.

def k_distances2(x, k):
    dim0 = x.shape[0]
    dim1 = x.shape[1]
    p=-2*x.dot(x.T)+np.sum(x**2, axis=1).T+ np.repeat(np.sum(x**2, axis=1),dim0,axis=0).reshape(dim0,dim0)
    p = np.sqrt(p)
    p.sort(axis=1)
    p=p[:,:k]
    pm= p.flatten()
    pm= np.sort(pm)
    return p, pm
m, m2= k_distances2(X, 2)
plt.plot(m2)
plt.ylabel("k-distances")
plt.grid(True)
plt.show()
Lithuanian answered 29/5, 2018 at 2:2 Comment(1)
Why dim1 is never used? Do you want to use reshape(dim0, dim1) instead of reshape(dim0, dim0)?Pertussis
W
9

First you can define a function to calculate the distance of each point to its k-th nearest neighbor:

def calculate_kn_distance(X,k):

    kn_distance = []
    for i in range(len(X)):
        eucl_dist = []
        for j in range(len(X)):
            eucl_dist.append(
                math.sqrt(
                    ((X[i,0] - X[j,0]) ** 2) +
                    ((X[i,1] - X[j,1]) ** 2)))

        eucl_dist.sort()
        kn_distance.append(eucl_dist[k])

    return kn_distance

Then, once you have defined your function, you can choose a k value and plot the histogram to find a knee to define an appropriate epsilon value.

eps_dist = calculate_kn_distance(X[1],4)
plt.hist(eps_dist,bins=30)
plt.ylabel('n');
plt.xlabel('Epsilon distance');

enter image description here

In the example above, a vast majority of points lie within 0.12 units from their 4th nearest neighbor. So, a heuristic approach, could be choosing 0.12 as epsilon parameter.

Weariful answered 11/4, 2019 at 8:56 Comment(0)
L
8

You probably want to use the matrix operations provided by numpy to speed up your distance matrix calculation.

def k_distances2(x, k):
    dim0 = x.shape[0]
    dim1 = x.shape[1]
    p=-2*x.dot(x.T)+np.sum(x**2, axis=1).T+ np.repeat(np.sum(x**2, axis=1),dim0,axis=0).reshape(dim0,dim0)
    p = np.sqrt(p)
    p.sort(axis=1)
    p=p[:,:k]
    pm= p.flatten()
    pm= np.sort(pm)
    return p, pm
m, m2= k_distances2(X, 2)
plt.plot(m2)
plt.ylabel("k-distances")
plt.grid(True)
plt.show()
Lithuanian answered 29/5, 2018 at 2:2 Comment(1)
Why dim1 is never used? Do you want to use reshape(dim0, dim1) instead of reshape(dim0, dim0)?Pertussis
S
4

To get distances you could use this function:

import numpy as np
import pandas as pd
import math

def k_distances(X, n=None, dist_func=None):
    """Function to return array of k_distances.

    X - DataFrame matrix with observations
    n - number of neighbors that are included in returned distances (default number of attributes + 1)
    dist_func - function to count distance between observations in X (default euclidean function)
    """
    if type(X) is pd.DataFrame:
        X = X.values
    k=0
    if n == None:
        k=X.shape[1]+2
    else:
        k=n+1

    if dist_func == None:
        # euclidean distance square root of sum of squares of differences between attributes
        dist_func = lambda x, y: math.sqrt(
            np.sum(
                np.power(x-y, np.repeat(2,x.size))
            )
        )

    Distances = pd.DataFrame({
        "i": [i//10 for i in range(0, len(X)*len(X))],
        "j": [i%10 for i in range(0, len(X)*len(X))],
        "d": [dist_func(x,y) for x in X for y in X]
    })
    return np.sort([g[1].iloc[k].d for g in iter(Distances.groupby(by="i"))])

X should be pandas.DataFrame or numpy.ndarray. n is number of neighbors that are in d-neighborhood. You should know this number. By default is number of attributes + 1.

To plot those distances you could use this code:

import matplotlib.pyplot as plt

d = k_distances(X,n,dist_func)
plt.plot(d)
plt.ylabel("k-distances")
plt.grid(True)
plt.show()
Shep answered 21/3, 2018 at 12:16 Comment(0)
C
4

I'll try my best to do an extensive guide for future viewers
In a nutshell the steps are (using distance matrix)

  1. Get the sorted distance matrix
  2. Get the kth column (kth column represents the distances with kth neighbour)
  3. Sort the kth column in descending order
  4. Plot it in y-axis and (0-n) in x-axis

Lets take a simple dataset with n = 7

   x,   y
1  1,   1
2  1.5, 1.5
3  1.25,1.25
4  1.5, 1
5  1,   1.5
6  1.75,1.75
7  3,   2

enter image description here

The sorted distance matrix
[[0, 0.353, 0.5,   0.5,   0.707, 1.061, 2.236]
 [0, 0.353, 0.353, 0.5,   0.5,   0.707, 1.581]
 [0, 0.353, 0.353, 0.353, 0.353, 0.707, 1.904]
 [0, 0.353, 0.5,   0.5,   0.707, 0.791, 1.803]
 [0, 0.353, 0.5,   0.5,   0.707, 0.791, 2.062]
 [0, 0.353, 0.707, 0.791, 0.791, 1.061, 1.275]
 [0, 1.273, 1.581, 1.803, 1.904, 2.062, 2.236]]  
 
 The sorted kth (k=4) Column (5th column)  
 [1.904,0.791,0.707,0.707,0.707,0.5,0.353]  

Now plotting this will yield

plt.plot([0,1,2,3,4,5,6],[1.904,0.791,0.707,0.707,0.707,0.5,0.353]) 

enter image description here

As you can see there is a strong bend at 0.79. So for k/minpts = 4, any point which have kth neighbour distance > 0.79 will be considered noise/non-core point.

Of course there's no guarantee that there will be a strong bend or even a bend at all in the graph it completely depends on the distribution of data
The original paper

Coates answered 21/12, 2019 at 20:43 Comment(2)
Link dead, working alternative from aaai.org: new.aaai.org/Papers/KDD/1996/KDD96-037.pdfContrary
@FilippoVitale Chrome is showing not secure for some reasons, however I found another alternative file.biolab.si/papers/1996-DBSCAN-KDD.pdfCoates

© 2022 - 2024 — McMap. All rights reserved.