How to use Vectorization with NumPy arrays to calculate geodesic distance using Geopy library for a large dataset?
Asked Answered
T

4

10

I am trying to calculate geodesic distance from a dataframe which consists of four columns of latitude and longitude data with around 3 million rows. I used the apply lambda method to do it but it took 18 minutes to finish the task. Is there a way to use Vectorization with NumPy arrays to speed up the calculation? Thank you for answering.

My code using apply and lambda method:

from geopy import distance

df['geo_dist'] = df.apply(lambda x: distance.distance(
                              (x['start_latitude'], x['start_longitude']),
                              (x['end_latitude'], x['end_longitude'])).miles, axis=1)

Updates:

I am trying this code but it gives me the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Appreciate if anyone can help.

df['geo_dist'] = distance.distance(
                          (df['start_latitude'].values, df['start_longitude'].values),
                          (df['end_latitude'].values, df['end_longitude'].values)).miles
Tate answered 10/5, 2018 at 14:17 Comment(1)
Geopy distance routines don't support vectorization currently. Perhaps @cffk, the author of geodesic routines, might suggest a solution there? This issue is being tracked in github.com/geopy/geopy/issues/189Cannot
L
4

I think you might consider using geopandas for this, it's an extension of pandas (and therefore numpy) designed to do these types of calculations very quickly.

Specifically, it has a method for calculating the distance between sets of points in a GeoSeries, which can be a column of a GeoDataFrame. I’m fairly certain that this method leverages numexpr for vectorization.

It should look something like this, where you convert your data frame to a GeoDataFrame with (at least) two GeoSeries columns that you can use for the origin and point destinations. This should return a GeoSeries object:

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)

distances = gdf.geometry.distance(gdf.destination_geometry)
Leukoderma answered 11/5, 2018 at 4:50 Comment(2)
Where does gdf.destination_geometry come from? I got it to work by creating a dest_geometry using a different set of longitudes and latitudes, creating a dest_gdf, then calling gdf.geometry.distance(dest_gdf.geometry). Also, that gpd.GeoDataFrame line gives a FutureWarning on the authority code. As best I can tell the line should be: gdf = gpd.GeoDataFrame(df, crs='epsg:4326', geometry=geometry)Impeachment
The geopandas distance calculation makes use of GEOS to calculate distance. The GEOS calculations are all linear. Thus, this solution does not provide a geodesic distance.Flint
B
3

The answer to your question: You cannot do what you want to do with geopy. I am not familiar with this package but the error traceback shows that this function and possibly all other functions in this package were not written/designed with vectorized computations in mind.

Now, if you can do with great-circle distances, then I would suggest that you experiment with astropy.coordinates package that my be able to compute separations between points in a vectorial way.

Here is an example based on my answer to a different question: Finding closest point:

from astropy.units import Quantity
from astropy.coordinates import SkyCoord, EarthLocation
from astropy.constants import R_earth
import numpy as np

lon1 = Quantity([-71.312796, -87.645307, -87.640426, -87.635513,
                 -87.630629, -87.625793 ], unit='deg')
lat1 = Quantity([41.49008, 41.894577, 41.894647, 41.894713,
                 41.894768, 41.894830], unit='deg')
lon2 = Quantity([-81.695391, -87.645307 + 0.5, -87.640426, -87.635513 - 0.5,
                 -87.630629 + 1.0, -87.625793 - 1.0], unit='deg')
lat2 = Quantity([41.499498, 41.894577 - 0.5, 41.894647, 41.894713 - 0.5,
                 41.894768 - 1.0, 41.894830 + 1.0], unit='deg')

pts1 = SkyCoord(EarthLocation.from_geodetic(lon1, lat1, height=R_earth).itrs, frame='itrs')
pts2 = SkyCoord(EarthLocation.from_geodetic(lon2, lat2, height=R_earth).itrs, frame='itrs')

Then, distances between the two sets of points can be computed as:

>>> dist = pts2.separation(pts1)
>>> print(dist)
<Angle [ 7.78350849, 0.62435354, 0., 0.62435308, 1.25039805, 1.24353876] deg>

Approximate conversion to distance:

>>> np.deg2rad(pts2.separation(pts1)) * R_earth / u.rad
<Quantity [ 866451.17527216,  69502.31527953,      0.        ,
             69502.26348614, 139192.86680148, 138429.29874024] m>

Compare the first value with what you would get from the geopy's example:

>>> distance.distance((41.49008, -71.312796), (41.499498, -81.695391)).meters
866455.4329098687

EDIT: Actually, quite possibly this may actually give you the geodesic distance that you are after but make sure to check the description of EarthLocation.

Boreal answered 11/5, 2018 at 3:29 Comment(0)
A
0

Going back and forth with numpy:

from geopy import distance

lats = df['latitude'].values
lons = df['longitude'].values
latsNext = np.roll(lats, 1)
lonsNext = np.roll(lons, 1)
dists = [distance.distance((lat0, lon0),(lat1, lon1)).kilometers for lat0, lon0, lat1, lon1 in zip(lats, lons, latsNext, lonsNext)]
dists = np.roll(dists, -1)
dists[-1] = np.nan
df['distance'] = dists
Anteater answered 19/8, 2020 at 8:10 Comment(0)
I
0

See a similar question here with a much faster alternative.

import pandas as pd
import numpy as np

def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2

    r = 3958.756 #6371 for distance in KM for miles use 3958.756
    dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
    return pd.Series(dist)


df['dist'] = haversine(df['start_latitude'], df['start_longitude'], df['end_latitude'], df['end_longitude'])
Indiction answered 15/10, 2022 at 3:32 Comment(1)
The haversine equation is only intended for short distances and when precision isn't too important as it assumes a spherical Earth.Flint

© 2022 - 2024 — McMap. All rights reserved.