This is the output of a careful density-based clustering using the quite new HDBSCAN* algorithm.
Using Haversine distance, instead of Euclidean!
It identified some 50-something regions that are substantially more dense than their surroundings. In this figure, some clusters look as if they had only 3 elements, but they do have many more.
The outermost area, these are the noise points that do not belong to any cluster at all!
(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction -algorithm SLINKHDBSCANLinearMemory -algorithm.distancefunction geo.LatLngDistanceFunction -hdbscan.minPts 20 -hdbscan.minclsize 20
)
OPTICS is another density-based algorithm, here is a result:
Again, we have a "noise" area with red dots are not dense at all.
Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.1 -algorithm.distancefunction geo.LatLngDistanceFunction -optics.minpts 25
The OPTICS plot for this data set looks like this:
You can see there are many small valleys that correspond to clusters. But there is no "large" structure here.
You probably were looking for a result like this:
But in fact, this is a meaningless and rather random way of breaking the data into large chunks. Sure, it minimizes variance; but it does not at all care about the structure of the data. Points within one cluster will frequently have less in common than points in different clusters. Just look at the points at the border between the red, orange, and violet clusters.
Last but not least, the oldtimers: hierarchical clustering with complete linkage:
and the dendrogram:
(Parameters used: -verbose -dbc.in file.csv -parser.labelIndices 0,1 -algorithm clustering.hierarchical.extraction.SimplifiedHierarchyExtraction -algorithm AnderbergHierarchicalClustering -algorithm.distancefunction geo.LatLngDistanceFunction -hierarchical.linkage CompleteLinkageMethod -hdbscan.minclsize 50
)
Not too bad. Complete linkage works on such data rather well, too. But you could merge or split any of these clusters.
lng
column twice. I updated the file. I don't know why the extra columnX
appears. – Clariceclarieplot(lat ~ lng, data = data, col = dbscan(data[, c("lat", "lng")], eps = 0.004, minPts = 3)$cluster + 1L, pch = 20)
. However, your ellipses suggest that you want to cluster dense and non-dense parts together. I don't see how this could work. A decision tree would be good for a supervised classification task, but your data is not labled. Maybe someone has an idea. Otherwise I'd try my luck on stats.stackexchange.com, too. – Stocktaking