DBSCAN for clustering data by location and density

Asked 25/1, 2016 at 11:54 Answered 6/12, 2017 at 17:51

Solved r machine-learning cluster-analysis data-mining dbscan

I'm using the method dbscan::dbscan in order to cluster my data by location and density.

My data looks like this:

str(data)
'data.frame': 4872 obs. of 3 variables:
 $ price    : num ...
 $ lat      : num ...
 $ lng      : num ...

Now I'm using following code:

EPS = 7
cluster.dbscan <- dbscan(data, eps = EPS, minPts = 30, borderPoints = T, 
search = "kdtree")
plot(lat ~ lng, data = data, col = cluster.dbscan$cluster + 1L, pch = 20)

but the result isn't satisfying at all, the point's aren't really clustered.

I would like to have the clusters nicely defined, something like this:

I also tried to use use a decision tree classifier tree:tree which works better, but I can't tell if it is really a good classification.

File:

http://www.file-upload.net/download-11246655/file.csv.html

Question:

is it possible to achieve what I want?
am I using the right method?
should I play more with the parameters? if yes, with which?

Clariceclarie answered 25/1, 2016 at 11:54 Comment(7)

You should provide the data set if possible, or a toy data set. – Stocktaking 25/1, 2016 at 12:2

@Stocktaking I just added the dataset. – Clariceclarie 25/1, 2016 at 12:8

I don't think the data in the file represents you data set. Please have a look. – Stocktaking 25/1, 2016 at 12:14

@lukeA: yes, you're right. I mistakenly added the lng column twice. I updated the file. I don't know why the extra column X appears. – Clariceclarie 25/1, 2016 at 12:34

Thanks. The price attribute is not used? Well, if you want to cluster by location and density, dbscan would be a good choice. It puts together dense parts, e.g. plot(lat ~ lng, data = data, col = dbscan(data[, c("lat", "lng")], eps = 0.004, minPts = 3)$cluster + 1L, pch = 20). However, your ellipses suggest that you want to cluster dense and non-dense parts together. I don't see how this could work. A decision tree would be good for a supervised classification task, but your data is not labled. Maybe someone has an idea. Otherwise I'd try my luck on stats.stackexchange.com, too. – Stocktaking 25/1, 2016 at 13:9

What you sketched are not density based clusters. So if that is what you want, you will need a different algorithm (but from what you sketched, I'd say it is statistically not reasonable but rather random. – Football 25/1, 2016 at 19:57

@Stocktaking I used the price in the decision tree classification algorithm. After putting some thoughts in what I need, I realized that I need to cluster by location and price and not location and density. – Clariceclarie 27/1, 2016 at 8:58

This is the output of a careful density-based clustering using the quite new HDBSCAN* algorithm.

Using Haversine distance, instead of Euclidean!

It identified some 50-something regions that are substantially more dense than their surroundings. In this figure, some clusters look as if they had only 3 elements, but they do have many more.

The outermost area, these are the noise points that do not belong to any cluster at all!

(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction -algorithm SLINKHDBSCANLinearMemory -algorithm.distancefunction geo.LatLngDistanceFunction -hdbscan.minPts 20 -hdbscan.minclsize 20)

OPTICS is another density-based algorithm, here is a result:

Again, we have a "noise" area with red dots are not dense at all.

Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.1 -algorithm.distancefunction geo.LatLngDistanceFunction -optics.minpts 25

The OPTICS plot for this data set looks like this:

You can see there are many small valleys that correspond to clusters. But there is no "large" structure here.

You probably were looking for a result like this:

But in fact, this is a meaningless and rather random way of breaking the data into large chunks. Sure, it minimizes variance; but it does not at all care about the structure of the data. Points within one cluster will frequently have less in common than points in different clusters. Just look at the points at the border between the red, orange, and violet clusters.

Last but not least, the oldtimers: hierarchical clustering with complete linkage:

and the dendrogram:

(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.SimplifiedHierarchyExtraction -algorithm AnderbergHierarchicalClustering -algorithm.distancefunction geo.LatLngDistanceFunction -hierarchical.linkage CompleteLinkageMethod -hdbscan.minclsize 50)

Not too bad. Complete linkage works on such data rather well, too. But you could merge or split any of these clusters.

Football answered 26/1, 2016 at 0:40 Comment(5)

thank you very much for your response and explanation. The hierarchical clustering is the closest one to what I wanted. I've put some thoughts and what I need is creating clusters upon location and price. The result of the hierarchical clustering is very plausible for me. – Clariceclarie 27/1, 2016 at 8:39

Could you tell us with what implementation you have produced those results? Is that implemented in r ? – Bumper 27/1, 2016 at 11:23

I used ELKI for these plots, because it supports Haversine distance very well. – Football 27/1, 2016 at 12:10

@Anony-Mousse I tried the parameters you used, in the ELKI GUI and it worked perfectly, but when using it directly in Java, it returns a single cluster. – Clariceclarie 4/2, 2016 at 11:53

The Java API is tricky to use, because of default values. There are so many parameters in ELKI, and the MiniGUI and command line interfaces do a good job at setting all the defaults. If possible, I use a shell script and the command line interface. – Football 4/2, 2016 at 18:31

You can use something called as Hullplot
In your cases

hullplot(select(data, lng, lat), cluster.dbscan$cluster)

Distraction answered 6/12, 2017 at 17:51 Comment(0)

Recommended topics

Hot tags