dbscan - setting limit on maximum cluster span

Asked 31/8, 2013 at 10:29 Answered 19/8, 2022 at 22:3

Solved python algorithm cluster-analysis data-mining dbscan

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer: see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am misunderstanding DBSCAN if that's what's happening here.)

Is there an algorithm that is density-based like DBSCAN but takes into account some kind of thresholding for the maximum distance between any two points in a cluster?

Copp answered 31/8, 2013 at 10:29 Comment(0)

DBSCAN indeed does not impose a total size constraint on the cluster.

The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).

I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).

You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.

However, I also have the impression that you aren't really sure on what you need and why.

A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.

Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.

If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.

Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).

Tophet answered 31/8, 2013 at 21:47 Comment(0)

I had the same problem and ended up solving it by using DBSCAN in combination with KMeans clustering: First I use DBSCAN to identify high density clusters and remove outliers, then I take any cluster larger than 250 Miles (in my case) and break it apart. Here's the code:

from sklearn.cluster import DBSCAN 
clustering = DBSCAN(eps=0.3, min_samples=100).fit(load_geocodes[['lat', 'long']])
load_geocodes.loc[:,'cluster'] = clustering.labels_

import mpu
def calculate_cluster_size(lat, long):
    left_top = (max(lat), min(long))
    right_bottom = (min(lat), max(long))
    distance = mpu.haversine_distance(left_top, right_bottom)*0.621371
    return distance

for c, df in load_geocodes.groupby('cluster'):
    if c == -1:
        continue # don't do this for outliers
    distance = calculate_cluster_size(df['lat'], df['long'])
    print(distance)
    if distance > 250:
        # break clusters into more clusters until the maximum size of a cluster is less than 250 Miles
        max_distance = distance
        i = 2
        while max_distance > 250:
            kmeans = KMeans(n_clusters=i, random_state=0).fit(df[['lat', 'long']])
            df.loc[:, 'cl_temp'] = kmeans.labels_
            max_temp_cl_size = 0
            for temp_cl, temp_cl_df in df.groupby('cl_temp'):
                temp_cl_size = calculate_cluster_size(temp_cl_df['lat'], temp_cl_df['long'])
                if temp_cl_size > max_temp_cl_size:
                    max_temp_cl_size = temp_cl_size 
            i += 1
            max_distance = max_temp_cl_size
        load_geocodes.loc[df.index,'subcluster'] = kmeans.labels_

Greybeard answered 19/8, 2022 at 22:3 Comment(0)

Recommended topics

Hot tags