Pandas Dataframe: join items in range based on their geo coordinates (longitude and latitude)
Asked Answered
W

2

9

I got a dataframe that contains places with their latitude and longitude. Imagine for example cities.

df = pd.DataFrame([{'city':"Berlin", 'lat':52.5243700, 'lng':13.4105300},
                   {'city':"Potsdam", 'lat':52.3988600, 'lng':13.0656600},
                   {'city':"Hamburg", 'lat':53.5753200, 'lng':10.0153400}]);

Now I'm trying to get all cities in a radius around another. Let's say all cities in a distance of 500km from Berlin, 500km from Hamburg and so on. I would do this by duplicating the original dataframe and joining both with a distance-function.

The intermediate result would be somewhat like this:

Berlin --> Potsdam
Berlin --> Hamburg
Potsdam --> Berlin
Potsdam --> Hamburg
Hamburg --> Potsdam
Hamburg --> Berlin

This final result after grouping (reducing) should be like this. Remark: Would be cool if the list of values includes all columns of the city.

Berlin --> [Potsdam, Hamburg]
Potsdam --> [Berlin, Hamburg]
Hamburg --> [Berlin, Potsdam]

Or just the count of cities 500km around one city.

Berlin --> 2
Potsdam --> 2
Hamburg --> 2

Since I'm quite new to Python, I would appreciate any starting point. I'm familiar with haversine distance. But not sure if there are useful distance/spatial methods in Scipy or Pandas.

Glad if you can give me a starting point. Up to now I tried following this post.

Update: The original idea behind this question comes from the Two Sigma Connect Rental Listing Kaggle Competition. The idea is to get those listing 100m around another listing. Which a) indicates a density and therefore a popular area and b) if the addresses are compares, you can find out if there is a crossing and therefore a noisy area. Therefore you not need the full item to item relation since you need to compare not only the distance but also the address and other meta-data. PS: I'm not uploading a solution to Kaggle. I just want to learn.

Whitebook answered 18/3, 2017 at 17:49 Comment(2)
Haversine doesn't really belong in scipy or pandas. Define a function in the code. This one seems like it should work: #4913849Wassail
"I'm familiar with haversine distance" -- no such animal; haversine(x) is a trig function like sine(x), cosine(x) etc. There are several formulas for calculation of a great-circle distance between 2 points; three of these formulas use the haversine(x) function explicitly or implicitly.Rikki
A
12

You can use:

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

First need cross join with merge, remove row with same values in city_x and city_y by boolean indexing:

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566

Then apply haversine function:

df['dist'] = df.apply(lambda row: haversine(row['lng_x'], 
                                            row['lat_x'], 
                                            row['lng_y'], 
                                            row['lat_y']), axis=1)

Filter distance:

df = df[df.dist < 500]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.215704
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.223782
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.215704
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.464120
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.223782
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.464120

And last create list or get size with groupby:

df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin     [Potsdam, Hamburg]
Hamburg     [Berlin, Potsdam]
Potsdam     [Berlin, Hamburg]
Name: city_y, dtype: object

df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin     2
Hamburg    2
Potsdam    2
dtype: int64

Also is possible use numpy haversine solution:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)

df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.198616
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.063541
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.198616
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.311890
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.063541
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.311890
Avirulent answered 18/3, 2017 at 18:7 Comment(7)
Nice starting point. But imagine the dataframe contains millions of locations. Is there a way to apply the 500km-haversine during the join? So that not each item is joined with every other item but only with those who are in the desired range.Whitebook
I think it is problem, because dont know distance befory apply function haversine. :(Avirulent
where did you get the radians variable from? inside your haversine function? ah, found it in math.radiansWhitebook
it is math function, need from math import radians, cos, sin, asin, sqrt Avirulent
I realized that there is a conditional join in PySpark but unfortunately not in Pandas. I will try that on Monday as soon as I have access to a Spark cluster. Let you know then!Whitebook
I ended up using the scipy.spatial.KDTree into which I added all longitude/latitude converted into 3D carthesian coordinates. This was then a lookup-index to determine those points of interest in a close distance to the current one. And it avoids a full join on a dataset of 49.000 rows. But your answer was correct according to my question!Whitebook
I ran both options haversine vs haversine_np on my df of around 6k rows and found that haversine_np was 5x faster, ~41 ms.Pigeon
B
2

UPDATE: I would suggest first to buiuld a distance DataFrame:

from scipy.spatial.distance import squareform, pdist
from itertools import combinations

# see definition of "haversine_np()" below     
x = pd.DataFrame({'dist':pdist(df[['lat','lng']], haversine_np)},
                 index=pd.MultiIndex.from_tuples(tuple(combinations(df['city'], 2))))

efficiently produces pairwise distance DF (without duplicates):

In [106]: x
Out[106]:
                       dist
Berlin  Potsdam   27.198616
        Hamburg  255.063541
Potsdam Hamburg  242.311890

Old answer:

Here is a bit optimized version, which uses scipy.spatial.distance.pdist method:

from scipy.spatial.distance import squareform, pdist

# slightly modified version: of https://mcmap.net/q/46814/-fast-haversine-approximation-python-pandas
def haversine_np(p1, p2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lat1, lon1, lat2, lon2 = np.radians([p1[0], p1[1],
                                         p2[0], p2[1]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

x = pd.DataFrame(squareform(pdist(df[['lat','lng']], haversine_np)),
                 columns=df.city.unique(),
                 index=df.city.unique())

this gives us:

In [78]: x
Out[78]:
             Berlin     Potsdam     Hamburg
Berlin     0.000000   27.198616  255.063541
Potsdam   27.198616    0.000000  242.311890
Hamburg  255.063541  242.311890    0.000000

let's count number of cities where the distance is greater than 30:

In [81]: x.groupby(level=0, as_index=False) \
    ...:  .apply(lambda c: c[c>30].notnull().sum(1)) \
    ...:  .reset_index(level=0, drop=True)
Out[81]:
Berlin     1
Hamburg    2
Potsdam    1
dtype: int64
Bugaboo answered 18/3, 2017 at 23:27 Comment(3)
But if the original data contains millions of lines then this wouldn't work since the number of columns would be exceeded, right?Whitebook
@Matthias, well if you have millions of unique cities, then it might be slow... But pdist - is optimal in terms of calculating pairwise distance. I'd suggest you to try this approach...Bugaboo
Very nice solution, but does not seem to work on the original data. I updated the post above, so that you could try it on the data from the Kaggle competition. It contains rental listings based on longitude and latitude and the combinations would be on the df.index.Whitebook

© 2022 - 2024 — McMap. All rights reserved.