Distance betweeen coordinates Python vs R computation time

import numpy as np import pandas as pd import xarray as xr from geopy.distance import geodesic from timeit import default_timer as timer df = pd.DataFrame() city_coord_orig = (4.351749, 50.845701) city_coord_orig_r = tuple(reversed(city_coord_orig)) N = 100000 np.random.normal() df['or'] = [city_coord_orig_r] * N df['new'] = df.apply(lambda x: (x['or'][0] + np.random.normal(), x['or'][1] + np.random.normal()), axis=1) start = timer() df['d2city2'] = df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1) end = timer() print(end - start)

# clean up rm(list = ls()) # read libraries library(geosphere) city.coord.orig <- c(4.351749, 50.845701) N<-100000 many <- data.frame(x=rep(city.coord.orig[1], N) + rnorm(N), y=rep(city.coord.orig[2], N) + rnorm(N)) city.coord.orig <- c(4.351749, 50.845701) start_time <- Sys.time() many$d2city <- distGeo(city.coord.orig, many[,c("x","y")]) end_time <- Sys.time() end_time - start_time

You are using .apply(), which uses a simple loop to run your function for each and every row. The distance calculation is done entirely in Python (geopy uses geographiclib which appears to be written in Python only). Non-vectorised distance calculations are slow, what you need is a vectorised solution using compiled code, just like when calculating the Haversine distance.

pyproj offers verctorised WSG84 distance calculations (the methods of the pyproj.Geod class accept numpy arrays) and wraps the PROJ4 library, meaning it runs these calculations in native machine code:

from pyproj import Geod

# split out coordinates into separate columns
df[['or_lat', 'or_lon']] = pd.DataFrame(df['or'].tolist(), index=df.index)
df[['new_lat', 'new_lon']] = pd.DataFrame(df['new'].tolist(), index=df.index)

wsg84 = Geod(ellps='WGS84')
# numpy matrix of the lon / lat columns, iterable in column order
or_and_new = df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T
df['d2city2'] = wsg84.inv(*or_and_new)[-1] / 1000  # as km

This clocks in at considerably better times:

>>> from timeit import Timer
>>> count, total = Timer(
...     "wsg84.inv(*df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T)[-1] / 1000",
...     'from __main__ import wsg84, df'
... ).autorange()
>>> total / count * 10 ** 3  # milliseconds
66.09873340003105

66 milliseconds to calculate 100k distances, not bad!

To make the comparison objective, here is your geopy / df.apply() version on the same machine:

>>> count, total = Timer("df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1)", 'from __main__ import geodesic, df').autorange()
>>> total / count * 10 ** 3  # milliseconds
25844.119450000107

25.8 seconds, not even in the same ballpark.

Recommended topics

Hot tags