how do I cluster a list of geographic points by distance?

Asked 31/10, 2018 at 2:34 Answered 6/11, 2018 at 20:38

Solved python cluster-analysis latitude-longitude spatial-query

I have a list of points P=[p1,...pN] where pi=(latitudeI,longitudeI).

Using Python 3, I would like to find a smallest set of clusters (disjoint subsets of P) such that every member of a cluster is within 20km of every other member in the cluster.

Distance between two points is computed using the Vincenty method.

To make this a little more concrete, suppose I have a set of points such as

from numpy import *
points = array([[33.    , 41.    ],
       [33.9693, 41.3923],
       [33.6074, 41.277 ],
       [34.4823, 41.919 ],
       [34.3702, 41.1424],
       [34.3931, 41.078 ],
       [34.2377, 41.0576],
       [34.2395, 41.0211],
       [34.4443, 41.3499],
       [34.3812, 40.9793]])

Then I am trying to define this function:

from geopy.distance import vincenty
def clusters(points, distance):
    """Returns smallest list of clusters [C1,C2...Cn] such that
       for x,y in Ci, vincenty(x,y).km <= distance """
    return [points]  # Incorrect but gives the form of the output

NOTE: Many questions cluster on geo location and attribute. My question is for location only. This is for lat/lon, not Euclidean distance. There are other questions out there that give sort-of answers but not the answer to this question (many unanswered):

Westminster answered 31/10, 2018 at 2:34 Comment(4)

why not use the S2 library to create the 20km zones and see which points are in each? – Ritornello 3/11, 2018 at 0:40

tried k-means clustering with sklearn? it seams like treating it as a classification problem might help – Isodimorphism 6/11, 2018 at 15:10

I don't know K a priori. – Westminster 6/11, 2018 at 17:19

well no, you're optimizing for k, see below – Isodimorphism 6/11, 2018 at 21:59

This might be a start. the algorithm attempts to k means cluster the points by iterating k from 2 to the number of points validating each solution along the way. You should pick the lowest number.

It works by clustering the points and then checking that each cluster obeys the constraint. If any cluster is not compliant the solution is labeled as False and we move on to the next number of clusters.

Because the K-means algorithm used in sklearn falls into local minima, proving whether or not this is the solution you're looking for is the best one is still to be established, but it could be one

import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import math

points = np.array([[33.    , 41.    ],
       [33.9693, 41.3923],
       [33.6074, 41.277 ],
       [34.4823, 41.919 ],
       [34.3702, 41.1424],
       [34.3931, 41.078 ],
       [34.2377, 41.0576],
       [34.2395, 41.0211],
       [34.4443, 41.3499],
       [34.3812, 40.9793]])


def distance(origin, destination): #found here https://gist.github.com/rochacbruno/2883505
    lat1, lon1 = origin[0],origin[1]
    lat2, lon2 = destination[0],destination[1]
    radius = 6371 # km
    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d

def create_clusters(number_of_clusters,points):
    kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(points)
    l_array = np.array([[label] for label in kmeans.labels_])
    clusters = np.append(points,l_array,axis=1)
    return clusters

def validate_solution(max_dist,clusters):
    _, __, n_clust = clusters.max(axis=0)
    n_clust = int(n_clust)
    for i in range(n_clust):
        two_d_cluster=clusters[clusters[:,2] == i][:,np.array([True, True, False])]
        if not validate_cluster(max_dist,two_d_cluster):
            return False
        else:
            continue
    return True

def validate_cluster(max_dist,cluster):
    distances = cdist(cluster,cluster, lambda ori,des: int(round(distance(ori,des))))
    print(distances)
    print(30*'-')
    for item in distances.flatten():
        if item > max_dist:
            return False
    return True

if __name__ == '__main__':
    for i in range(2,len(points)):
        print(i)
        print(validate_solution(20,create_clusters(i,points)))

Once a benchmark established one would have to focus more one each cluster to establish whether its' points could be distributed to others without violating the distance constraint.

You can replace the lambda function in cdist with whatever distance metric you chose, I found the great circle distance in the repo i mentioned.

Isodimorphism answered 6/11, 2018 at 20:38 Comment(6)

This looks like an efficient solution. Someone asked me if this isn't an NP-hard problem in general, but it doesn't seem to be. Can you argue that the second pass to check points in one cluster that are also near enough to points in another cluster, that that 2nd pass doesn't make it NP hard? You would have to move pts from one cluster to another such that 2 clusters end up being combinable, is that NP-hard? – Westminster 7/11, 2018 at 16:28

well the second pass would be a series of pairwise comparisons, which according to the abstract of this paper: sciencedirect.com/science/article/pii/S0898122199000486 is O(n^2), so although impractical at scale, not impossible per se. Now if the sphere you're considering is the earth and the distance constraint is 20 rough math means only 1.623.699 disjoint zones could fit on the surface (not accounting for packing losses, just simple division of areas) so that puts a higher bound on the value of k even if len(points) goes higher, so no, i do not believe it is NP-Hard – Isodimorphism 7/11, 2018 at 17:26

I'll give you the 100 pts for giving it a good go. Nice observation about the # of zones. The practical origin of this problem is an ongoing programming challenge called Mercury (iarpa.gov/challenges/mercury.html). The main task in Mercury is to predict future events based on prior events in a region. When you predict you have to give a location and date. The prediction counts if the event is within say 4 days of the predicted date and within say 20KM of the actual location. So the natural thing to do is find the center of the 20KM-radius cluster of prior events. – Westminster 7/11, 2018 at 23:58

Thank you that was a very interesting problem to work on – Isodimorphism 8/11, 2018 at 0:2

Coming across this 3 years later - wanna add that I used a binary search to speed this up so you don't iterate through every value of K! – Eurypterid 29/3, 2021 at 10:11

How did you apply a binary search here? – Gargan 19/1, 2022 at 8:35

Here is a solution that seems correct and will behave O(N^2) worst case and better depending on the data:

def my_cluster(S,distance):
    coords=set(S)
    C=[]
    while len(coords):
        locus=coords.pop()
        cluster = [x for x in coords if vincenty(locus,x).km <= distance]
        C.append(cluster+[locus])
        for x in cluster:
            coords.remove(x)
    return C

NOTE: I am not marking this as an answer because one of my requirements is that it be a smallest set of clusters. My first pass is good but I haven't proven that it is a smallest set.

The result (on a larger set of points) can be visualized as follows:

Westminster answered 1/11, 2018 at 2:34 Comment(0)

Recommended topics

Hot tags