DBSCAN for clustering of geographic location data

H

5

33

I have a dataframe with latitude and longitude pairs.

Here is my dataframe look like.

    order_lat  order_long
0   19.111841   72.910729
1   19.111342   72.908387
2   19.111342   72.908387
3   19.137815   72.914085
4   19.119677   72.905081
5   19.119677   72.905081
6   19.119677   72.905081
7   19.120217   72.907121
8   19.120217   72.907121
9   19.119677   72.905081
10  19.119677   72.905081
11  19.119677   72.905081
12  19.111860   72.911346
13  19.111860   72.911346
14  19.119677   72.905081
15  19.119677   72.905081
16  19.119677   72.905081
17  19.137815   72.914085
18  19.115380   72.909144
19  19.115380   72.909144
20  19.116168   72.909573
21  19.119677   72.905081
22  19.137815   72.914085
23  19.137815   72.914085
24  19.112955   72.910102
25  19.112955   72.910102
26  19.112955   72.910102
27  19.119677   72.905081
28  19.119677   72.905081
29  19.115380   72.909144
30  19.119677   72.905081
31  19.119677   72.905081
32  19.119677   72.905081
33  19.119677   72.905081
34  19.119677   72.905081
35  19.111860   72.911346
36  19.111841   72.910729
37  19.131674   72.918510
38  19.119677   72.905081
39  19.111860   72.911346
40  19.111860   72.911346
41  19.111841   72.910729
42  19.111841   72.910729
43  19.111841   72.910729
44  19.115380   72.909144
45  19.116625   72.909185
46  19.115671   72.908985
47  19.119677   72.905081
48  19.119677   72.905081
49  19.119677   72.905081
50  19.116183   72.909646
51  19.113827   72.893833
52  19.119677   72.905081
53  19.114100   72.894985
54  19.107491   72.901760
55  19.119677   72.905081

I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.

from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

array([[ 0.        ,  0.2522482 ,  0.2522482 , ...,  1.67313071,
     1.05925366,  1.05420922],
   [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,
     0.81742536,  0.98978355],
   [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,
     0.81742536,  0.98978355],
   ..., 
   [ 1.67313071,  1.44111548,  1.44111548, ...,  0.        ,
     1.02310118,  1.22871515],
   [ 1.05925366,  0.81742536,  0.81742536, ...,  1.02310118,
     0.        ,  1.39923529],
   [ 1.05420922,  0.98978355,  0.98978355, ...,  1.22871515,
     1.39923529,  0.        ]])

Then I am applying DBSCAN clustering algorithm on distance matrix.

 from sklearn.cluster import DBSCAN

 db = DBSCAN(eps=2,min_samples=5)
 y_db = db.fit_predict(distance_matrix)

I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.

Heresy answered 3/1, 2016 at 17:9 Comment(6)

Note that DBSCAN does not bound the pairwise distances in a cluster. It joins sets of radius epsilon transitively, which means there is no useful upper limit to the maximum distance (eps+eps+eps+eps+eps+... every join increases the maximum by eps, so the maximum distance is (numCorePointsInCluster+1)*epsilon). It is a design intention of the algorithm to allow this to happen. – Filipino 18/11, 2016 at 7:47

@Anony-Mousse Is it possible to limit the cluster size to a max, using the available DBSCAN options? – Witenagemot 23/5, 2018 at 13:41

No. When everything is connected, everything is one single cluster by definition. And it should be, by the concept of clustering: similar things should be in the same cluster, no matter how many. If you are more interested in controlling the size of the cluster, you probably are more into a quantization method instead. – Filipino 23/5, 2018 at 18:26

Hi, thank you for the question, and I am also keen to know what is the unit for the epsilon? For example, eps=2, does it mean 2km? or 200m? – Dowsabel 6/12, 2021 at 16:24

Just came to this question. 2 additions - @YeoKeat - eps units are the same as the units entered. Since you're using latitude & longitude this distance changes where you are on the globe and in the direction you travel. If you use a locally square coordinate system like ums then it will be km or whatever unit you enter. – Eboni 20/9, 2023 at 19:43

2nd addition - OPTICS is a density based slustering technique and has a feature 'max_eps' that can be used to set an effective maximum size – Eboni 20/9, 2023 at 19:45

F

16

DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors). But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.

However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)

uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.

See the documentation of DBSCAN (emphasis added):

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)

metric : string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

similar for fit_predict:

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

In other words, you need to do

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

Filipino answered 3/1, 2016 at 18:11 Comment(6)

That was really helpful. I am working on a project called online food ordering application,where I have to cluster order locations in real time for route optimization. Is DBSCAN good approach for this kind of problem? – Heresy 3/1, 2016 at 18:27

I'd use something that knows e.g. about one-way streets (or streets in general). I doubt clustering helps much, but there are specific algorithms for route optimization. Although a simple greedy approach may be the way to go if you need it to be fast. – Filipino 3/1, 2016 at 18:37

Thanks for the help. – Heresy 3/1, 2016 at 18:40

Heyy @Anony-Mousse I realized your above comment and I have a question for you. I have a data from vehicles gps on one motorway and one busway that close motorway. I need only use motorwat cars data so can I find with DBSCAN algorith which vehicle are buses then for remove motorway data? – Gothar 3/5, 2017 at 4:51

That isn't a clustering task, but.preprocessing. – Filipino 3/5, 2017 at 6:23

This answer does not reply to the original question: "how to choose eps & min_samples value". Also "DBSCAN is meant to be used on the raw data" is not true it depends on the application and the type of metric used – Cactus 6/10, 2020 at 11:48

G

70

You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.

db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))

This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit() takes the coordinates in radian units for the haversine metric.

Girardi answered 2/8, 2016 at 23:0 Comment(0)

F

16

DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors). But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.

However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)

uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.

See the documentation of DBSCAN (emphasis added):

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)

metric : string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

similar for fit_predict:

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

In other words, you need to do

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

Filipino answered 3/1, 2016 at 18:11 Comment(6)

That was really helpful. I am working on a project called online food ordering application,where I have to cluster order locations in real time for route optimization. Is DBSCAN good approach for this kind of problem? – Heresy 3/1, 2016 at 18:27

I'd use something that knows e.g. about one-way streets (or streets in general). I doubt clustering helps much, but there are specific algorithms for route optimization. Although a simple greedy approach may be the way to go if you need it to be fast. – Filipino 3/1, 2016 at 18:37

Thanks for the help. – Heresy 3/1, 2016 at 18:40

Heyy @Anony-Mousse I realized your above comment and I have a question for you. I have a data from vehicles gps on one motorway and one busway that close motorway. I need only use motorwat cars data so can I find with DBSCAN algorith which vehicle are buses then for remove motorway data? – Gothar 3/5, 2017 at 4:51

That isn't a clustering task, but.preprocessing. – Filipino 3/5, 2017 at 6:23

This answer does not reply to the original question: "how to choose eps & min_samples value". Also "DBSCAN is meant to be used on the raw data" is not true it depends on the application and the type of metric used – Cactus 6/10, 2020 at 11:48

C

8

I don't know what implementation of haversine you're using but it looks like it returns results in km so eps should be 0.2, not 2 for 200 m.

For the min_samples parameter, that depends on what your expected output is. Here are a couple of examples. My outputs are using an implementation of haversine based on this answer which gives a distance matrix similar, but not identical to yours.

This is with db = DBSCAN(eps=0.2, min_samples=5)

[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]

This creates three clusters, 0, 1 and 2, and a lot of the samples don't fall into a cluster with at least 5 members and so are not assigned to a cluster (shown as -1).

Trying again with a smaller min_samples value:

db = DBSCAN(eps=0.2, min_samples=2)

[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3 7 -1 3 -1 -1 3]

Here most of the samples are within 200m of at least one other sample and so fall into one of eight clusters 0 to 7.

Edited to add

It looks like @Anony-Mousse is right, though I didn't see anything wrong in my results. For the sake of contributing something, here's the code I was using to see the clusters:

from math import radians, cos, sin, asin, sqrt

from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN

import matplotlib.pyplot as plt
import pandas as pd


def haversine(lonlat1, lonlat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lat1, lon1 = lonlat1
    lat2, lon2 = lonlat2
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r


X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed')  # using "precomputed" as recommended by @Anony-Mousse
y_db = db.fit_predict(distance_matrix)

X['cluster'] = y_db

plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()

Canker answered 3/1, 2016 at 17:42 Comment(12)

Yup, I am using the same implementation of haversine. When I use 0.2 it still clusters point which are way too far from each other. – Heresy 3/1, 2016 at 17:51

When you say it clusters points that are too far away from each other, do you mean far from the nearest point in the cluster or from the farthest point in the cluster? – Canker 3/1, 2016 at 17:53

How can I know the boundary of the cluster? I am new to clustering. The Point I am trying to make is, distance between two points is more than 2 km but still it includes in a cluster. – Heresy 3/1, 2016 at 17:58

Can you give me an example? I'm not seeing that in my results. Unless you're thinking of -1 as a cluster? – Canker 3/1, 2016 at 18:15

is -1 noise? I was considering it as a cluster – Heresy 3/1, 2016 at 18:21

Yes, -1 is anything not assigned to a cluster. – Canker 3/1, 2016 at 18:22

ok. what if I want to remove the noise? Do I have to increase eps parameter ? – Heresy 3/1, 2016 at 18:28

As you can see in my second example, if you reduce the min_samples parameter you will get more clusters since the minimum number of members requirement is lower, and so there will be fewer locations left unassigned. If you increase the eps parameter then you will get fewer clusters with more members. It's up to you which is more useful for your purposes. – Canker 3/1, 2016 at 18:31

ok. understood. problem with my use case is order locations are not known in advance. I have to cluster it in real time. so, that route optimization can be performed. – Heresy 3/1, 2016 at 18:37

Let us continue this discussion in chat. – Canker 3/1, 2016 at 18:47

@Heresy DBSCAN does not guarantee a maximum distance of any two points. Clusters can be much larger than the core radius epsilon by intention. – Filipino 18/11, 2016 at 7:49

Hi, I am also keen to know what is the unit for the epsilon? For example, eps=2, does it mean 2km? or 200m? – Dowsabel 6/12, 2021 at 16:24

C

1

@eos Gives the best answer I think - as well as making use of Haversine distance (the most relevant distance measure in this case), it avoids the need to generate a precomputed distance matrix. If you create a distance matrix then you need to calculate the pairwise distances for every combination of points (although you can obviously save a bit of time by taking advantage of the fact that your distance metric is symmetric).

If you just supply DBSCAN with a distance measure and use the ball_tree algorithm though, it can avoid the need to calculate every possible distance. This is because the ball tree algorithm can use the triangular inequality theorem to reduce the number of candidates that need to be checked to find the nearest neighbours of a data point (this is the biggest job in DBSCAN).

The triangular inequality theorem states:

|x+y| <= |x| + |y|

...so if a point p is distance x from its neighbour n, and another point q is a distance y from p, if x+y is greater than our nearest neighbour radius, we know that q must be too far away from n to be considered a neighbour, so we don't need to calculate its distance.

Read more about how ball trees work in the scikit-learn documentation

Cap answered 22/9, 2019 at 22:45 Comment(0)

C

1

There are three different things you can do to use DBSCAN with GPS data. The first is that you can use the eps parameter to specify the maximum distance between data points that you will consider to create a cluster, as specified in other answers you need to take into account the scale of the distance metric you are using a pick a value that makes sense. Then you can use the min_samples this can be used as a way to filtering out data points while moving. Last the metric will allow you to use whatever distance you want.

As an example, in a particular research project I'm working on I want to extract significant locations from a subject's GPS data locations collected from their smartphone. I'm not interested on how the subject navigates through the city and also I'm more comfortable dealing with distances in meters then I can do the next:

from geopy import distance
def mydist(p1, p2):
     return distance.great_circle((p1[0],p1[1],100),(p2[0],p2[1],100)).meters
DBSCAN(eps=50,min_samples=50,n_jobs=-1,metric=mydist)

Here eps as per the DBSCAN documentation "The maximum distance between two samples for one to be considered as in the neighborhood of the other." While min samples is "The number of samples (or total weight) in a neighborhood for a point to be considered as a core point." Basically with eps you control how close data points in a cluster should be, in the example above I selected 100 meters. Min samples is just a way to control for density, in the example above the data was captured at about one sample per second, because I'm not interested in when people are moving around but instead stationary locations I want to make sure I get at least the equivalent of 60 seconds of GPS data from the same location.

If this still does not make sense take a look at this DBSCAN animation.

Cactus answered 6/10, 2020 at 22:21 Comment(0)

Recommended topics

Hot tags