Fast Haversine Approximation (Python/Pandas)
Asked Answered
R

7

59

Each row in a Pandas dataframe contains lat/lng coordinates of 2 points. Using the Python code below, calculating the distances between these 2 points for many (millions) of rows takes a very long time!

Considering that the 2 points are under 50 miles apart and accuracy is not very important, is it possible to make the calculation faster?

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km


for index, row in df.iterrows():
    df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude'])
Repatriate answered 9/4, 2015 at 18:0 Comment(14)
A better approach than approximation would be to profile the function to get a sense of exactly why it takes too long, followed by using ctypes/Cython/numba to translate the function as-is into a C function that runs without as much overhead. You may need to modify your calling convention to use the numpy array values of data underlying each pandas Series column of data, and you can also checkout numpy.ctypeslib for easy conversion from a numpy array to a ctypes-compatible array. It seems like a lot, but really it's a pretty easy way to access C functions in Python.Hermineherminia
It may be possible to avoid doing the calculation for a majority of candidates. Calculate the min and max longitutes and latitudes 50 miles from your starting point. Then use those those mins and maxes to weed out most of the candidates.Lioness
You could also consider constructing a k-d tree out of the data, rather than storing it in a relational structure like a DataFrame. Then it would be cheap to get neighbors of a given point, and perhaps you could only calculate distances on demand. Does the application always need every pair? Yet another option could be to cluster the points and use each cluster's centroid/mean as a proxy. Then the distance between any two points would be approximated by the distance between only the cluster centers. It's speculative whether anything fancy like this is really better than brute force though.Hermineherminia
In the clustering case, you might also want to calculate true distances for all pairs in the same cluster (for each cluster) so that those points don't have effective distance of zero. This is also amenable to simple parallelization: give each process a copy of the data (or have it load a copy) along with a list of indices that it is responsible for. Then that process computes pairwise distance of those indices against all other indices, and writes them somewhere. With only millions of rows, this should be reasonable on modest hardware.Hermineherminia
There is a haversine module on pypi that is implemented in C.Lioness
Wow the suggestions in the comments are beyond me! I was wondering whether I can do just sqrt((lat2 - lat1)^^2 + (lon2 - lon1)^^2) will that even work? All the points lie within new york city.Repatriate
@Repatriate The function you provided in your question gives the great circle distance. The calculation in your comment gives the euclidean distance. Because the radius of the earth is so large, you can absolutely approximate with the euclidean version for small distances.Cowled
@ballsdotballs Seems like the euclidean distance will work for me! How should the result of sqrt((lat2 - lat1)^^2 + (lon2 - lon1)^^2) be converted into miles?Repatriate
Yeah the euclidian approximation will work fine for small enough distances. You shouldn't even need to do an apply for that, can directly just use the columns in the dataframe.Castera
Are your lat1, long1, lat1, long1 that are calculated all on the same row?Raulrausch
check this answer: #54947119Principate
I'm confused why people hinted that sqrt((lat2 - lat1)^^2 + (lon2 - lon1)^^2) would work half decently? You'd still need a factor to turn into a distance (111 km ≈ 1° of lat)... but unless you're near the equator, lat and long are significantly different in size. So for example at 45N, an error of 30% on longitude parts of the calculation (maybe very rough is fine, but it won't affect all answers the same. 111*sqrt((lat2 - lat1)^^2 + (lon2 - lon1)*cos(π*lon1/180)^^2) would seem a better rough estimate? It's an old Q, but popular, so helpfully helpful to avoid that unless needing REALLY roughSartorial
Iterrows is hundreds to thousands of times slower than pure vectorization. I have worked through 13 ways to process a Pandas dataframe and will be presenting it soon hopefully, plotting the speed of each.Intercept
Done: see my answer: How to iterate over Pandas DataFrames without iterating. I present a speed plot of 13 techniques. Vectorization is up to 1400x faster.Intercept
C
147

Here is a vectorized numpy version of the same function:

import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    
    All args must be of equal length.    
    
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6378.137 * c
    return km

The inputs are all arrays of values, and it should be able to do millions of points instantly. The requirement is that the inputs are ndarrays but the columns of your pandas table will work.

For example, with randomly generated values:

>>> import numpy as np
>>> import pandas
>>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000)
>>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
>>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

Or if you want to create another column:

>>> df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

Looping through arrays of data is very slow in python. Numpy provides functions that operate on entire arrays of data, which lets you avoid looping and drastically improve performance.

This is an example of vectorization.

Cowled answered 9/4, 2015 at 19:3 Comment(11)
Good to know about that term array programming, didn't come across it with MATLAB.Mayhap
Thank you very much for this. Small suggestion: add a real-world example usage with actual coordinates, instead of random values, to clarify the input format.Myelitis
Note that this also works when one pair of arguments is a Series and another is a tuple: haversine_np(pd.Series([-74.00594, -122.41942]), pd.Series([40.71278, 37.77493]), -87.65005, 41.85003) calculates distance between (New York, San Francisco) and Chicago.Myelitis
Another minor suggestion: you might want to exchange the order of function arguments to lat, lon. In many sources latitude goes first, e.g. in en.wikipedia.org/wiki/Horizontal_position_representation.Myelitis
u know it actually works even with different sized arrays ;) through dimension projectionEphrayim
@DennisGolomazov afaik GIS libraries and software usually order it as lon, lat. Latitude first seems to be mainly a Google Maps choice. I'd personally find it odd as lat, lon.Frae
How would I tie the resulting array back to the original df?Immortalize
How can it be applied to different sized arrays, i.e. to find the distance between each point in the first array and every point in the second array?Freedom
I made a feature request to sklearn to add your code: github.com/scikit-learn/scikit-learn/issues/17212Harlen
It should also be noted that numpy implementation in this specific case will be 12 times slower than implementation using from math import radians, cos, sin, asin, sqrt and doing it a bit more manually. numpy benefits will show face when you apply it to a whole dataframe at once.Khelat
I just added a Python/Pandas answer and plots where I show 13 techniques to iterate over a Pandas DataFrame, and show that manually iterating can be up to 1400x slower than pure vectorization. Great answer showing the Haversine technique, by the way. I referenced it recently.Intercept
H
20

Purely for the sake of an illustrative example, I took the numpy version in the answer from @ballsdotballs and also made a companion C implementation to be called via ctypes. Since numpy is such a highly optimized tool, there is little chance that my C code will be as efficient, but it should be somewhat close. The big advantage here is that by running through an example with C types, it can help you see how you can connect up your own personal C functions to Python without too much overhead. This is especially nice when you just want to optimize a small piece of a bigger computation by writing that small piece in some C source rather than Python. Simply using numpy will solve the problem most of the time, but for those cases when you don't really need all of numpy and you don't want to add the coupling to require use of numpy data types throughout some code, it's very handy to know how to drop down to the built-in ctypes library and do it yourself.

First let's create our C source file, called haversine.c:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int haversine(size_t n, 
              double *lon1, 
              double *lat1, 
              double *lon2, 
              double *lat2,
              double *kms){

    if (   lon1 == NULL 
        || lon2 == NULL 
        || lat1 == NULL 
        || lat2 == NULL
        || kms == NULL){
        return -1;
    }

    double km, dlon, dlat;
    double iter_lon1, iter_lon2, iter_lat1, iter_lat2;

    double km_conversion = 2.0 * 6367.0; 
    double degrees2radians = 3.14159/180.0;

    int i;
    for(i=0; i < n; i++){
        iter_lon1 = lon1[i] * degrees2radians;
        iter_lat1 = lat1[i] * degrees2radians;
        iter_lon2 = lon2[i] * degrees2radians;
        iter_lat2 = lat2[i] * degrees2radians;

        dlon = iter_lon2 - iter_lon1;
        dlat = iter_lat2 - iter_lat1;

        km = pow(sin(dlat/2.0), 2.0) 
           + cos(iter_lat1) * cos(iter_lat2) * pow(sin(dlon/2.0), 2.0);

        kms[i] = km_conversion * asin(sqrt(km));
    }

    return 0;
}

// main function for testing
int main(void) {
    double lat1[2] = {16.8, 27.4};
    double lon1[2] = {8.44, 1.23};
    double lat2[2] = {33.5, 20.07};
    double lon2[2] = {14.88, 3.05};
    double kms[2]  = {0.0, 0.0};
    size_t arr_size = 2;

    int res;
    res = haversine(arr_size, lon1, lat1, lon2, lat2, kms);
    printf("%d\n", res);

    int i;
    for (i=0; i < arr_size; i++){
        printf("%3.3f, ", kms[i]);
    }
    printf("\n");
}

Note that we're trying to keep with C conventions. Explicitly passing data arguments by reference, using size_t for a size variable, and expecting our haversine function to work by mutating one of the passed inputs such that it will contain the expected data on exit. The function actually returns an integer, which is a success/failure flag that could be used by other C-level consumers of the function.

We're going to need to find a way to handle all of these little C-specific issues inside of Python.

Next let's put our numpy version of the function along with some imports and some test data into a file called haversine.py:

import time
import ctypes
import numpy as np
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = (np.sin(dlat/2)**2 
         + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c
    return km

if __name__ == "__main__":
    lat1 = 50.0 * np.random.rand(1000000)
    lon1 = 50.0 * np.random.rand(1000000)
    lat2 = 50.0 * np.random.rand(1000000)
    lon2 = 50.0 * np.random.rand(1000000)

    t0 = time.time()
    r1 = haversine(lon1, lat1, lon2, lat2)
    t1 = time.time()
    print t1-t0, r1

I chose to make lats and lons (in degrees) that are randomly chosen between 0 and 50, but it doesn't matter too much for this explanation.

The next thing we need to do is to compile our C module in such a way that it can be dynamically loaded by Python. I'm using a Linux system (you can find examples for other systems very easily on Google), so my goal is to compile haversine.c into a shared object, like so:

gcc -shared -o haversine.so -fPIC haversine.c -lm

We can also compile to an executable and run it to see what the C program's main function displays:

> gcc haversine.c -o haversine -lm
> ./haversine
0
1964.322, 835.278, 

Now that we have compiled the shared object haversine.so, we can use ctypes to load it in Python and we need to supply the path to the file to do so:

lib_path = "/path/to/haversine.so" # Obviously use your real path here.
haversine_lib = ctypes.CDLL(lib_path)

Now haversine_lib.haversine acts pretty much just like a Python function, except that we might need to do some manual type marshaling to make sure the inputs and outputs are interpreted correctly.

numpy actually provides some nice tools for this and the one I'll use here is numpy.ctypeslib. We're going to build a pointer type that will allow us to pass around numpy.ndarrays to these ctypes-loaded functions as through they were pointers. Here's the code:

arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, 
                                       ndim=1, 
                                       flags='CONTIGUOUS')

haversine_lib.haversine.restype = ctypes.c_int
haversine_lib.haversine.argtypes = [ctypes.c_size_t,
                                    arr_1d_double, 
                                    arr_1d_double,
                                    arr_1d_double,
                                    arr_1d_double,
                                    arr_1d_double] 

Notice that we tell the haversine_lib.haversine function proxy to interpret its arguments according to the types we want.

Now, to test it out from Python what remains is to just make a size variable, and an array that will be mutated (just like in the C code) to contain the result data, then we can call it:

size = len(lat1)
output = np.empty(size, dtype=np.double)
print "====="
print output
t2 = time.time()
res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output)
t3 = time.time()
print t3 - t2, res
print type(output), output

Putting it all together in the __main__ block of haversine.py, the whole file now looks like this:

import time
import ctypes
import numpy as np
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = (np.sin(dlat/2)**2 
         + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c
    return km

if __name__ == "__main__":
    lat1 = 50.0 * np.random.rand(1000000)
    lon1 = 50.0 * np.random.rand(1000000)
    lat2 = 50.0 * np.random.rand(1000000)
    lon2 = 50.0 * np.random.rand(1000000)

    t0 = time.time()
    r1 = haversine(lon1, lat1, lon2, lat2)
    t1 = time.time()
    print t1-t0, r1

    lib_path = "/home/ely/programming/python/numpy_ctypes/haversine.so"
    haversine_lib = ctypes.CDLL(lib_path)
    arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, 
                                           ndim=1, 
                                           flags='CONTIGUOUS')

    haversine_lib.haversine.restype = ctypes.c_int
    haversine_lib.haversine.argtypes = [ctypes.c_size_t,
                                        arr_1d_double, 
                                        arr_1d_double,
                                        arr_1d_double,
                                        arr_1d_double,
                                        arr_1d_double]

    size = len(lat1)
    output = np.empty(size, dtype=np.double)
    print "====="
    print output
    t2 = time.time()
    res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output)
    t3 = time.time()
    print t3 - t2, res
    print type(output), output

To run it, which will run and time the Python and ctypes versions separately and print some results, we can just do

python haversine.py

which displays:

0.111340045929 [  231.53695005  3042.84915093   169.5158946  ...,  1359.2656769
  2686.87895954  3728.54788207]
=====
[  6.92017600e-310   2.97780954e-316   2.97780954e-316 ...,
   3.20676686e-001   1.31978329e-001   5.15819721e-001]
0.148446083069 0
<type 'numpy.ndarray'> [  231.53675618  3042.84723579   169.51575588 ...,  1359.26453029
  2686.87709456  3728.54493339]

As expected, the numpy version is slightly faster (0.11 seconds for vectors with length of 1 million) but our quick and dirty ctypes version is no slouch: a respectable 0.148 seconds on the same data.

Let's compare this to a naive for-loop solution in Python:

from math import radians, cos, sin, asin, sqrt

def slow_haversine(lon1, lat1, lon2, lat2):
    n = len(lon1)
    kms = np.empty(n, dtype=np.double)
    for i in range(n):
       lon1_v, lat1_v, lon2_v, lat2_v = map(
           radians, 
           [lon1[i], lat1[i], lon2[i], lat2[i]]
       )

       dlon = lon2_v - lon1_v 
       dlat = lat2_v - lat1_v 
       a = (sin(dlat/2)**2 
            + cos(lat1_v) * cos(lat2_v) * sin(dlon/2)**2)
       c = 2 * asin(sqrt(a)) 
       kms[i] = 6367 * c
    return kms

When I put this into the same Python file as the others and time it on the same million-element data, I consistently see a time of around 2.65 seconds on my machine.

So by quickly switching to ctypes we improve the speed by a factor of about 18. For many calculations that can benefit from access to bare, contiguous data, you often see gains much higher even than this.

Just to be super clear, I am not at all endorsing this as a better option than just using numpy. This is precisely the problem that numpy was built to solve, and so homebrewing your own ctypes code whenever it both (a) makes sense to incorporate numpy data types in your application and (b) there exists an easy way to map your code into a numpy equivalent, is not very efficient.

But it's still very helpful to know how to do this for those occasions when you prefer to write something in C yet call it in Python, or situations where a dependence on numpy is not practical (in an embedded system where numpy cannot be installed, for example).

Hermineherminia answered 9/4, 2015 at 20:28 Comment(1)
This is awesome!Traceytrachea
D
15

In case that using scikit-learn is allowed, I would give the following a chance:

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

# example data
lat1, lon1 = 36.4256345, -5.1510261
lat2, lon2 = 40.4165, -3.7026
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

X = [[lat1, lon1],
     [lat2, lon2]]
kms = 6367
print(kms * dist.pairwise(X))
Documentation answered 25/1, 2017 at 17:53 Comment(1)
Just watch out that the arguments order shall be lat, lon unlike in many GIS librariesRuffled
T
9

A trivial extension to @derricw's vectorised solution, you can use numba to improve performance by ~2x with virtually no change to your code. For pure numerical calculations, this should probably be used for benchmarking / testing versus possibly more efficient solutions.

from numba import njit

@njit
def haversine_nb(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians(lon1), np.radians(lat1), np.radians(lon2), np.radians(lat2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

Benchmarking versus the Pandas function:

%timeit haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
# 1 loop, best of 3: 1.81 s per loop

%timeit haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)
# 1 loop, best of 3: 921 ms per loop

Full benchmarking code:

import pandas as pd, numpy as np
from numba import njit

def haversine_pd(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

@njit
def haversine_nb(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians(lon1), np.radians(lat1), np.radians(lon2), np.radians(lat2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

np.random.seed(0)
lon1, lon2, lat1, lat2 = np.random.randn(4, 10**7)
df = pd.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
km = haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
km_nb = haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)

assert np.isclose(km.values, km_nb).all()

%timeit haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
# 1 loop, best of 3: 1.81 s per loop

%timeit haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)
# 1 loop, best of 3: 921 ms per loop
Trivial answered 24/10, 2018 at 10:36 Comment(0)
P
3

The vectorized function specifies that "All args must be of equal length". By extending the bounds of the "larger" dataset, according to this, one can efficiently find the distance all i,j pairs of elements.

from random import uniform
import numpy as np

def new_haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1[:,None]

    dlat = lat2 - lat1[:,None]

    a = np.sin(dlat/2.0)**2 + np.cos(lat1[:,None]) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

lon1 = [uniform(-180,180) for n in range(6)]
lat1 = [uniform(-90, 90) for n in range(6)]
lon2 = [uniform(-180,180) for n in range(4)]
lat2 = [uniform(-90, 90) for n in range(4)]

new = new_haversine_np(lon1, lat1, lon2, lat2)

for i in range(6):
    for j in range(4):
        print(i,j,round(new[i,j],2))
Pomeroy answered 18/6, 2019 at 3:32 Comment(0)
S
2

Scikit-learn library also has another function for calculating the haversine distances called the haversine_distances function, which can be used to find the distances between two co-ordinate, see the example below:

from sklearn.metrics.pairwise import haversine_distances
import numpy as np

lat1, lon1 = [-34.83333, -58.5166646]
lat2, lon2 = [49.0083899664, 2.53844117956]
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])

result = haversine_distances([[lat1, lon1], [lat2, lon2]])[0][1]  # this formula returns a 2x2 array, this is the reason we used indexing.
print(result * 6373)  # multiply by Earth radius to get kilometers
Slaver answered 1/5, 2022 at 11:31 Comment(0)
C
0

Some of these answers are "rounding" the radius of the earth. If you check these against other distance calculators (such as geopy), these functions will be off.

You can switch out R=3959.87433 for the conversion constant below if you want the answer in miles.

If you want kilometers, use R= 6372.8.

lon1 = -103.548851
lat1 = 32.0004311
lon2 = -103.6041946
lat2 = 33.374939


def haversine(lat1, lon1, lat2, lon2):

      R = 3959.87433 # this is in miles.  For Earth radius in kilometers use 6372.8 km

      dLat = radians(lat2 - lat1)
      dLon = radians(lon2 - lon1)
      lat1 = radians(lat1)
      lat2 = radians(lat2)

      a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
      c = 2*asin(sqrt(a))

      return R * c

print(haversine(lat1, lon1, lat2, lon2))
Charpoy answered 30/7, 2017 at 3:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.