Calculating distance between *multiple* sets of geo coordinates in python

Asked 18/4, 2016 at 14:18 Answered 11/2 at 10:36

I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.

# starting dataframe is df

lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()

from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)

Pad answered 18/4, 2016 at 14:18 Comment(3)

Please confirm: you have a long list of co-ordinate pairs and you want to compute the distance between each pair? – Papert 18/4, 2016 at 14:24

How is your data being stored? Presumably there is some kind of loop surrounding this code? If not, how do you expect it to be repeated 400k times? – Culberson 18/4, 2016 at 14:27

You could use a KDTree algorithm were you don't have to calculate all distances among the pairs. Perhaps this answer can give you some insight – Cytotaxonomy 19/4, 2016 at 1:0

Edit: here's a simple notebook example

A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat) tuples, for instance). Name the new column coords.

import pandas as pd
import numpy as np
from geopy.distance import vincenty


# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
    np.zeros(len(df) ** 2).reshape(len(df), len(df)),
    index=df.index, columns=df.index)

This function looks up our 'end' coordinates from the df DataFrame using the input column name, then applies the geopy vincenty() function to each row in the input column, using the square.coords column as the first argument. This works because the function is applied column-wise from right to left.

def get_distance(col):
    end = df.ix[col.name]['coords']
    return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')

Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T), because the loc[] method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values

distances = square.apply(get_distance, axis=1).T

Your geopy values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters, .miles etc.

Something like the following should work:

def units(input_instance):
    return input_instance.meters

distances_meters = distances.applymap(units)

You can now index into your distance matrix using e.g. loc[row_index, column_index]. You should be able to adapt the above fairly easily. You might have to adjust the apply call in the get_distance function to ensure you're passing the correct values to great_circle. The pandas apply docs might be useful, in particular with regard to passing positional arguments using args (you'll need a recent pandas version for this to work).

This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.

Oh and also

I can't remember whether geopy expects coordinates as (lon, lat) or (lat, lon). I bet it's the latter (sigh).

Update Here's a working script as of May 2021.

import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
    np.zeros((df.shape[0], df.shape[0])),
    index=df.index, columns=df.index
)

# replacing distance.vicenty with distance.distance
def get_distance(col):
    end = df.loc[col.name, 'latlon']
    return df['latlon'].apply(geopy.distance.distance,
                              args=(end,),
                              ellipsoid='WGS-84'
                             )

distances = square.apply(get_distance, axis=1).T

Portuguese answered 18/4, 2016 at 15:45 Comment(1)

The code needs some fixes due to changes to geopy and pandas. vincenty needs to be replaced by distance. And that syntax for allocating a value in pandas is now: df.loc[col.name, 'coords'] – Glinys 19/2, 2021 at 17:50

I recently had to do a similar job, I ended writing a solution I consider very easy to understand and tweak to your needs, but possibly not the best/fastest:

Solution

It is very similar to what urschrei posted: assuming you want the distance between every two consecutive coordinates from a Pandas DataFrame, we can write a function to process each pair of points as the start and finish of a path, compute the distance and then construct a new DataFrame to be the return:

import pandas as pd
from geopy import Point, distance
   
def get_distances(coords: pd.DataFrame,
                  col_lat='lat',
                  col_lon='lon',
                  point_obj=Point) -> pd.DataFrame:
    traces = len(coords) -1
    distances = [None] * (traces)
    for i in range(traces):
        start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
        finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
        distances[i] = {
            'start': start,
            'finish': finish,
            'path distance': distance.geodesic(start, finish),
        }

    return pd.DataFrame(distances)

Usage example

coords = pd.DataFrame({
    'lat': [-26.244333, -26.238000, -26.233880, -26.260000, -26.263730],
    'lon': [-48.640946, -48.644670, -48.648480, -48.669770, -48.660700],
})

print('-> coords DataFrame:\n', coords)
print('-'*79, end='\n\n')

distances = get_distances(coords)
distances['total distance'] = distances['path distance'].cumsum()
print('-> distances DataFrame:\n', distances)
print('-'*79, end='\n\n')

# Or if you want to use tuple for start/finish coordinates:
print('-> distances DataFrame using tuples:\n', get_distances(coords, point_obj=tuple))
print('-'*79, end='\n\n')

Output example

-> coords DataFrame:
          lat        lon
0 -26.244333 -48.640946
1 -26.238000 -48.644670
2 -26.233880 -48.648480
3 -26.260000 -48.669770
4 -26.263730 -48.660700
------------------------------------------------------------------------------- 

-> distances DataFrame:
                                   start                             finish  \
0  26 14m 39.5988s S, 48 38m 27.4056s W   26 14m 16.8s S, 48 38m 40.812s W   
1      26 14m 16.8s S, 48 38m 40.812s W  26 14m 1.968s S, 48 38m 54.528s W   
2     26 14m 1.968s S, 48 38m 54.528s W     26 15m 36s S, 48 40m 11.172s W   
3        26 15m 36s S, 48 40m 11.172s W  26 15m 49.428s S, 48 39m 38.52s W   

           path distance         total distance  
0  0.7941932910049856 km  0.7941932910049856 km  
1  0.5943709651000332 km  1.3885642561050187 km  
2  3.5914909016938505 km   4.980055157798869 km  
3  0.9958396130609087 km   5.975894770859778 km  
------------------------------------------------------------------------------- 

-> distances DataFrame using tuples:
                       start                  finish         path distance
0  (-26.244333, -48.640946)    (-26.238, -48.64467)  0.7941932910049856 km
1      (-26.238, -48.64467)  (-26.23388, -48.64848)  0.5943709651000332 km
2    (-26.23388, -48.64848)     (-26.26, -48.66977)  3.5914909016938505 km
3       (-26.26, -48.66977)   (-26.26373, -48.6607)  0.9958396130609087 km
-------------------------------------------------------------------------------

Mckeever answered 7/12, 2020 at 18:1 Comment(0)

As of 19th May

For anyone working with multiple geolocation data, you can adapt the above code but modify a bit to read the CSV file in your data drive. the code will write the output distances in the marked folder.

import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
   col_lat='lat',
              col_lon='lon',
              point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
    start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
    finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
    distances[i] = {
        'start': start,
        'finish': finish,
        'path distance': distance.geodesic(start, finish),
    }
output = pd.DataFrame(distances)
output.to_csv('geopy_output.csv')
return output

I used the same code and generated distance data for over 50,000 coordinates.

Alsoran answered 19/5, 2021 at 12:6 Comment(0)

This can be achieved with a map

import geopy.distance

df['dist_origin_dest'] = list(map(geopy.distance.geodesic, df.loc[:, ["lat1", "lon1"]].values, df.loc[:, ["lat2", "lon2"]].values))

Catafalque answered 11/2 at 10:36 Comment(1)

Please further explain how this answers the question and complements the upvoted ones such as this one https://mcmap.net/q/1082229/-calculating-distance-between-multiple-sets-of-geo-coordinates-in-python – Swellfish 13/2 at 15:38

Oh and also

Solution

Usage example

Output example

Recommended topics

Hot tags