Python calculate lots of distances quickly
Asked Answered
W

5

7

I have an input of 36,742 points which means if I wanted to calculate the lower triangle of a distance matrix (using the vincenty approximation) I would need to generate 36,742*36,741*0.5 = 1,349,974,563 distances.

I want to keep the pair combinations which are within 50km of each other. My current set-up is as follows

shops= [[id,lat,lon]...]

def lower_triangle_mat(points):
    for i in range(len(shops)-1):
        for j in range(i+1,len(shops)):
            yield [shops[i],shops[j]]

def return_stores_cutoff(points,cutoff_km=0):
    below_cut = []
    counter = 0
    for x in lower_triangle_mat(points):
        dist_km = vincenty(x[0][1:3],x[1][1:3]).km
        counter += 1
        if counter % 1000000 == 0:
            print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5)))
        if dist_km <= cutoff_km:
            below_cut.append([x[0][0],x[1][0],dist_km])
    return below_cut

start = time.clock()
stores = return_stores_cutoff(points=shops,cutoff_km=50)
print(time.clock() - start)

This will obviously take hours and hours. Some possibilities I was thinking of:

  • Use numpy to vectorise these calculations rather than looping through
  • Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores
  • Instead of storing the points in a list use something like a quad-tree but I think that only helps with the ranking of close points rather than actual distance -> so I guess some kind of geodatabase
  • I can obviously try the haversine or project and use euclidean distances, however I am interested in using the most accurate measure possible
  • Make use of parallel processing (however I was having a bit of difficulty coming up how to cut the list to still get all the relevant pairs).

Edit: I think geohashing is definitely needed here - an example from:

from geoindex import GeoGridIndex, GeoPoint

geo_index = GeoGridIndex()
for _ in range(10000):
    lat = random.random()*180 - 90
    lng = random.random()*360 - 180
    index.add_point(GeoPoint(lat, lng))

center_point = GeoPoint(37.7772448, -122.3955118)
for distance, point in index.get_nearest_points(center_point, 10, 'km'):
    print("We found {0} in {1} km".format(point, distance))

However, I would also like to vectorise (instead of loop) the distance calculations for the stores returned by the geo-hash.

Edit2: Pouria Hadjibagheri - I tried using lambda and map:

# [B]: Mapping approach           
lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops)))

func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km)
# Trying to see if conditional statements slow this down
func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None

start = time.clock()
out_dist = list(map(func,lwr_tr_mat))
print(time.clock() - start)

start = time.clock()
out_dist = list(map(func_cond,lwr_tr_mat))
print(time.clock() - start)

And they were all around 61 seconds (I restricted number of stores to 2000 from 32,000). Perhaps I used map incorrectly?

Williamswilliamsburg answered 9/2, 2016 at 16:19 Comment(4)
Those all sound like good ideas.... what's the question?Piles
With so many points it's difficult for me to decide on the best course of action and was hoping for some pointers on what to try and what not to waste my time on.Williamswilliamsburg
@Piles Oh come on! The question is crystal clear.Stolon
Unless you have a beefy machine, you might need to do it in chunks. Storing those 1.3 billion distances will take upwards of 10.5 Gbytes of memory.Flivver
C
7

This sounds like a classic use case for k-D trees.

If you first transform your points into Euclidean space then you can use the query_pairs method of scipy.spatial.cKDTree:

from scipy.spatial import cKDTree

tree = cKDTree(data)
# where data is (nshops, ndim) containing the Euclidean coordinates of each shop
# in units of km

pairs = tree.query_pairs(50, p=2)   # 50km radius, L2 (Euclidean) norm

pairs will be a set of (i, j) tuples corresponding to the row indices of pairs of shops that are ≤50km from each other.


The output of tree.sparse_distance_matrix is a scipy.sparse.dok_matrix. Since the matrix will be symmetric and you're only interested in unique row/column pairs, you could use scipy.sparse.tril to zero out the upper triangle, giving you a scipy.sparse.coo_matrix. From there you can access the nonzero row and column indices and their corresponding distance values via the .row, .col and .data attributes:

from scipy import sparse

tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2)
udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal
ridx = udist.row    # row indices
cidx = udist.col    # column indices
dist = udist.data   # distance values
Casanova answered 9/2, 2016 at 18:22 Comment(7)
Ah thank you! This seems like a similar approach to using geo-hashing - would you happen to know the adv/disad. compared to it?Williamswilliamsburg
TBH I've never come across geohashing before (I'm a neuroscientist, not a geographer!). At a glance, the geoindex module you linked to seems to be implemented in pure Python and requires looping over array elements in Python, whereas cKDTree is written in C and operates on numpy arrays directly. Algorithmic efficiency aside, I might expect cKDTree to be faster by virtue of having much less Python overhead, but I don't really know enough about the geohashing algorithm to give you a proper answer. I suggest you benchmark them and find out!Casanova
Ali do you have any suggestions on how to save a ckdtree sparse distance matrix as unique pair wise combinations? Rather than matrixWilliamswilliamsburg
Thank you again! I also used np.column_stack to help export the whole thing into a CSV: "w.writerows(np.column_stack((points[udist.row ],points[udist.col], udist.data)))"Williamswilliamsburg
CSV isn't a great choice of output format. Text-based formats require more storage space and are typically much slower and more memory-intensive to read and write than binary formats (e.g. numpy's native .npy format). In the case of float data they also guarantee some loss of precision, since it's impossible to exactly represent an arbitrary float using a finite-length decimal string. Text-based formats make most sense in cases where the data needs to be human-readable, but I'm guessing that most humans wouldn't ever want to read over a million rows of float values...Casanova
Ali, I just noticed a weird problem; there are some observations which are within 500metres (e.g.) however if I impose a cut-off of 1600 metres they get dropped. I have tried running the whole thing just on them (e.g. 5 points) and that works so I am wondering whether there is something wrong in the code that drops them? If it helps my latest 'answer' contains my full code. I'm not too sure why this is happening - it's not like 500 metres is closer to the borderline-case (1600)Williamswilliamsburg
Just to add it appears that the ordering of my input CSV affects the final number of points I get within 1600metres? So I thought perhaps this line is affecting it? "udist = sparse.tril(tree_dist, k=-1) "Williamswilliamsburg
S
1

Have you tried mapping entire arrays and functions instead of iterating through them? An example would be as follows:

from numpy.random import rand

my_array = rand(int(5e7), 1)  # An array of 50,000,000 random numbers in double.

Now what is normally done is:

squared_list_iter = [value**2 for value in my_array]

Which of course works, but is optimally invalid.

The alternative would be to map the array with a function. This is done as follows:

func = lambda x: x**2  # Here is what I want to do on my array.

squared_list_map = map(func, test)  # Here I am doing it!

Now, one might ask, how is this any different, or even better for that matter? Since now we have added a call to a function, too! Here is your answer:

For the former solution (via iteration):

1 loop: 1.11 minutes.

Compared to the latter solution (mapping):

500 loop, on average 560 ns. 

Simultaneous conversion of a map() to list by list(map(my_list)) would increase the time by a factor of 10 to approximately 500 ms.

You choose!

Stolon answered 9/2, 2016 at 16:57 Comment(12)
Thanks for this! I tried to implement it in my code (see Edit2 to my original post), however did not really find a speed-up. For 2000 stores the original code took 61.3 seconds and this one took 61.1 seconds too.Williamswilliamsburg
Did you time it like this: %timeit out_dist = list(map(func,lwr_tr_mat)) or did your time the entire process? Cause you still have iterations in the first line.Stolon
I used: "start = time.clock() out_dist = list(map(func_cond,lwr_tr_mat)) print(time.clock() - start)"Williamswilliamsburg
aaah. That's the reason. The time is spent on that iteration, not on mapping. Reshape your matrix to n by 1, or apply "every" iteration as a function mapped to the matrix/array.Stolon
Also, install jupyter or ipython and the function %timeit to test individual lines. It's more accurate, and more informative.Stolon
The timeit module was giving me a few issues "... longer than the fastest. This could mean that an intermediate result is being cached ..." so I just used time() instead. I'm not sure what you mean by apply every iteration, however? Do you mean instead of using a generator load everything into RAM?Williamswilliamsburg
%timeit runs a line through a loop (chooses by default the best number of iterations from 1 to 10k) and takes an average. This is more accurate because (1) it thereby excluded the time taken to call the function, and (2) it excludes time variations that may occur due to other small operations that may be running on the system.Stolon
Yup. Don't use iterations, or you're back to square one. Either vectorisation, or mapping. Forget iterations exist. If you find yourself unable to do that, write C / Cython codes to do that part.Stolon
Hmm, I think generating a list of N*(N-1)*0.5 for N = 35,000 would take hours by itselfWilliamswilliamsburg
Not if you do it by mapping, or using numpy.reshape, then you can you var.tolist() function to get a list(), providing var is an object of numpy.array().Stolon
Pouria could you list a few numpy functions I can google to get this? From a base of "np.genfromtxt(path_to_csv, dtype=float, delimiter=',', names=True) ", where I would cross them to get a long list of the lower triangleWilliamswilliamsburg
You'll find reshape [ docs.scipy.org/doc/numpy-1.10.1/reference/generated/… ] and roll [ docs.scipy.org/doc/numpy-1.10.1/reference/generated/… ] helpful.Stolon
W
1

Thanks everyone's help. I think I have solved this by incorporating all the suggestions.

I use numpy to import the geographic co-ordinates and then project them using "France Lambert - 93". This lets me fill scipy.spatial.cKDTree with the points and then calculate a sparse_distance_matrix by specifying a cut-off of 50km (my projected points are in metres). I then extract extract the lower-triangle to a CSV.

import numpy as np
import csv
import time
from pyproj import Proj, transform

#http://epsg.io/2154 (accuracy: 1.0m)
fr = '+proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 \
+x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 \
+units=m +no_defs'

#http://epsg.io/27700-5339 (accuracy: 1.0m)
uk = '+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 \
+x_0=400000 +y_0=-100000 +ellps=airy \
+towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs'

path_to_csv = '.../raw_in.csv'
out_csv = '.../out.csv'

def proj_arr(points):
    inproj = Proj(init='epsg:4326')
    outproj = Proj(uk)
    # origin|destination|lon|lat
    func = lambda x: transform(inproj,outproj,x[2],x[1])
    return np.array(list(map(func, points)))

tstart = time.time()

# Import points as geographic coordinates
# ID|lat|lon
#Sample to try and replicate
#points = np.array([
#        [39007,46.585012,5.5857829],
#        [88086,48.192370,6.7296289],
#        [62627,50.309155,3.0218611],
#        [14020,49.133972,-0.15851507],
#        [1091, 42.981765,2.0104902]])
#
points = np.genfromtxt(path_to_csv,
                       delimiter=',',
                       skip_header=1)

print("Total points: %d" % len(points))
print("Triangular matrix contains: %d" % (len(points)*((len(points))-1)*0.5))
# Get projected co-ordinates
proj_pnts = proj_arr(points)

# Fill quad-tree
from scipy.spatial import cKDTree
tree = cKDTree(proj_pnts)
cut_off_metres = 1600
tree_dist = tree.sparse_distance_matrix(tree,
                                        max_distance=cut_off_metres,
                                        p=2) 

# Extract triangle
from scipy import sparse
udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal
print("Distances after quad-tree cut-off: %d " % len(udist.data))

# Export CSV
import csv
f = open(out_csv, 'w', newline='') 
w = csv.writer(f, delimiter=",", )
w.writerow(['id_a','lat_a','lon_a','id_b','lat_b','lon_b','metres'])
w.writerows(np.column_stack((points[udist.row ],
                             points[udist.col],
                             udist.data)))
f.close()

"""
Get ID labels
"""
id_to_csv = '...id.csv'
id_labels = np.genfromtxt(id_to_csv,
                       delimiter=',',
                       skip_header=1,
                       dtype='U')

"""
Try vincenty on the un-projected co-ordinates
"""
from geopy.distance import vincenty
vout_csv = '.../out_vin.csv'
test_vin = np.column_stack((points[udist.row].T[1:3].T,
                            points[udist.col].T[1:3].T))

func = lambda x: vincenty(x[0:2],x[2:4]).m
output = list(map(func,test_vin))

# Export CSV
f = open(vout_csv, 'w', newline='')
w = csv.writer(f, delimiter=",", )
w.writerow(['id_a','id_a2', 'lat_a','lon_a',
            'id_b','id_b2', 'lat_b','lon_b',
            'proj_metres','vincenty_metres'])
w.writerows(np.column_stack((list(id_labels[udist.row]),
                             points[udist.row ],
                             list(id_labels[udist.col]),
                             points[udist.col],
                             udist.data,
                             output,
                             )))

f.close()    
print("Finished in %.0f seconds" % (time.time()-tstart)

This approach took 164 seconds to generate (for 5,306,434 distances) - compared to 9 - and also around 90 seconds to save to disk.

I then compared the difference in the vincenty distance and the hypotenuse distance (on the projected co-ordinates).

The mean difference in metres was 2.7 and the mean difference/metres was 0.0073% - which looks great.

Williamswilliamsburg answered 10/2, 2016 at 18:47 Comment(0)
Y
0

"Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores" I think this might be better called gridding. So first make a dict, with a set of coords as the key and put each shop in a 50km bucket near that point. then when you are calculating distances, you only look in nearby buckets, rather than iterate through each shop in the whole universe

Yila answered 9/2, 2016 at 16:59 Comment(0)
L
0

You can use vectorization with the haversine formula discussed in this thread Haversine Formula in Python (Bearing and Distance between two GPS points)

lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2    
c = 2 * np.arcsin(np.sqrt(a))
km = 6371 * c

Here you have the %%timeit for 7 451 653 distances

642 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lashundalasker answered 30/6, 2022 at 14:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.