DBSCAN for clustering data by location and density
Asked Answered
C

2

7

I'm using the method dbscan::dbscan in order to cluster my data by location and density.

My data looks like this:

str(data)
'data.frame': 4872 obs. of 3 variables:
 $ price    : num ...
 $ lat      : num ...
 $ lng      : num ...

Now I'm using following code:

EPS = 7
cluster.dbscan <- dbscan(data, eps = EPS, minPts = 30, borderPoints = T, 
search = "kdtree")
plot(lat ~ lng, data = data, col = cluster.dbscan$cluster + 1L, pch = 20)

but the result isn't satisfying at all, the point's aren't really clustered.

enter image description here

I would like to have the clusters nicely defined, something like this: enter image description here

I also tried to use use a decision tree classifier tree:tree which works better, but I can't tell if it is really a good classification.

File:

http://www.file-upload.net/download-11246655/file.csv.html

Question:

  • is it possible to achieve what I want?
  • am I using the right method?
  • should I play more with the parameters? if yes, with which?
Clariceclarie answered 25/1, 2016 at 11:54 Comment(7)
You should provide the data set if possible, or a toy data set.Stocktaking
@Stocktaking I just added the dataset.Clariceclarie
I don't think the data in the file represents you data set. Please have a look.Stocktaking
@lukeA: yes, you're right. I mistakenly added the lng column twice. I updated the file. I don't know why the extra column X appears.Clariceclarie
Thanks. The price attribute is not used? Well, if you want to cluster by location and density, dbscan would be a good choice. It puts together dense parts, e.g. plot(lat ~ lng, data = data, col = dbscan(data[, c("lat", "lng")], eps = 0.004, minPts = 3)$cluster + 1L, pch = 20). However, your ellipses suggest that you want to cluster dense and non-dense parts together. I don't see how this could work. A decision tree would be good for a supervised classification task, but your data is not labled. Maybe someone has an idea. Otherwise I'd try my luck on stats.stackexchange.com, too.Stocktaking
What you sketched are not density based clusters. So if that is what you want, you will need a different algorithm (but from what you sketched, I'd say it is statistically not reasonable but rather random.Football
@Stocktaking I used the price in the decision tree classification algorithm. After putting some thoughts in what I need, I realized that I need to cluster by location and price and not location and density.Clariceclarie
F
13

This is the output of a careful density-based clustering using the quite new HDBSCAN* algorithm.

Using Haversine distance, instead of Euclidean!

It identified some 50-something regions that are substantially more dense than their surroundings. In this figure, some clusters look as if they had only 3 elements, but they do have many more.

enter image description here

The outermost area, these are the noise points that do not belong to any cluster at all!

(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction -algorithm SLINKHDBSCANLinearMemory -algorithm.distancefunction geo.LatLngDistanceFunction -hdbscan.minPts 20 -hdbscan.minclsize 20)

OPTICS is another density-based algorithm, here is a result:

enter image description here

Again, we have a "noise" area with red dots are not dense at all.

Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.1 -algorithm.distancefunction geo.LatLngDistanceFunction -optics.minpts 25

The OPTICS plot for this data set looks like this:

enter image description here

You can see there are many small valleys that correspond to clusters. But there is no "large" structure here.

You probably were looking for a result like this:

enter image description here

But in fact, this is a meaningless and rather random way of breaking the data into large chunks. Sure, it minimizes variance; but it does not at all care about the structure of the data. Points within one cluster will frequently have less in common than points in different clusters. Just look at the points at the border between the red, orange, and violet clusters.

Last but not least, the oldtimers: hierarchical clustering with complete linkage:

enter image description here

and the dendrogram:

enter image description here

(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.SimplifiedHierarchyExtraction -algorithm AnderbergHierarchicalClustering -algorithm.distancefunction geo.LatLngDistanceFunction -hierarchical.linkage CompleteLinkageMethod -hdbscan.minclsize 50)

Not too bad. Complete linkage works on such data rather well, too. But you could merge or split any of these clusters.

Football answered 26/1, 2016 at 0:40 Comment(5)
thank you very much for your response and explanation. The hierarchical clustering is the closest one to what I wanted. I've put some thoughts and what I need is creating clusters upon location and price. The result of the hierarchical clustering is very plausible for me.Clariceclarie
Could you tell us with what implementation you have produced those results? Is that implemented in r ?Bumper
I used ELKI for these plots, because it supports Haversine distance very well.Football
@Anony-Mousse I tried the parameters you used, in the ELKI GUI and it worked perfectly, but when using it directly in Java, it returns a single cluster.Clariceclarie
The Java API is tricky to use, because of default values. There are so many parameters in ELKI, and the MiniGUI and command line interfaces do a good job at setting all the defaults. If possible, I use a shell script and the command line interface.Football
D
1

You can use something called as Hullplot
In your cases

hullplot(select(data, lng, lat), cluster.dbscan$cluster)

Distraction answered 6/12, 2017 at 17:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.