more efficient way to calculate distance in numpy?

Asked 8/7, 2013 at 13:1 Answered 8/7, 2013 at 14:5

i have a question on how to calculate distances in numpy as fast as it can,

def getR1(VVm,VVs,HHm,HHs):
    t0=time.time()
    R=VVs.flatten()[numpy.newaxis,:]-VVm.flatten()[:,numpy.newaxis]
    R*=R
    R1=HHs.flatten()[numpy.newaxis,:]-HHm.flatten()[:,numpy.newaxis]
    R1*=R1
    R+=R1
    del R1
    print "R1\t",time.time()-t0, R.shape, #11.7576191425 (108225, 10500) 
    print numpy.max(R) #4176.26290975
    # uses 17.5Gb ram
    return R


def getR2(VVm,VVs,HHm,HHs):
    t0=time.time()
    precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten()))
    measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten()))
    deltas = precomputed_flat[None,:,:] - measured_flat[:, None, :]
    #print time.time()-t0, deltas.shape # 5.861109972 (108225, 10500, 2)
    R = numpy.einsum('ijk,ijk->ij', deltas, deltas)
    print "R2\t",time.time()-t0,R.shape, #14.5291359425 (108225, 10500)
    print numpy.max(R) #4176.26290975
    # uses 26Gb ram
    return R


def getR3(VVm,VVs,HHm,HHs):
    from numpy.core.umath_tests import inner1d
    t0=time.time()
    precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten()))
    measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten()))
    deltas = precomputed_flat[None,:,:] - measured_flat[:, None, :]
    #print time.time()-t0, deltas.shape # 5.861109972 (108225, 10500, 2)
    R = inner1d(deltas, deltas)
    print "R3\t",time.time()-t0, R.shape, #12.6972110271 (108225, 10500)
    print numpy.max(R) #4176.26290975
    #Uses 26Gb
    return R


def getR4(VVm,VVs,HHm,HHs):
    from scipy.spatial.distance import cdist
    t0=time.time()
    precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten()))
    measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten()))
    R=spdist.cdist(precomputed_flat,measured_flat, 'sqeuclidean') #.T
    print "R4\t",time.time()-t0, R.shape, #17.7022118568 (108225, 10500)
    print numpy.max(R) #4176.26290975
    # uses 9 Gb ram
    return R

def getR5(VVm,VVs,HHm,HHs):
    from scipy.spatial.distance import cdist
    t0=time.time()
    precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten()))
    measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten()))
    R=spdist.cdist(precomputed_flat,measured_flat, 'euclidean') #.T
    print "R5\t",time.time()-t0, R.shape, #15.6070930958 (108225, 10500)
    print numpy.max(R) #64.6240118667
    # uses only 9 Gb ram
    return R

def getR6(VVm,VVs,HHm,HHs):
    from scipy.weave import blitz
    t0=time.time()
    R=VVs.flatten()[numpy.newaxis,:]-VVm.flatten()[:,numpy.newaxis]
    blitz("R=R*R") # R*=R
    R1=HHs.flatten()[numpy.newaxis,:]-HHm.flatten()[:,numpy.newaxis]
    blitz("R1=R1*R1") # R1*=R1
    blitz("R=R+R1") # R+=R1
    del R1
    print "R6\t",time.time()-t0, R.shape, #11.7576191425 (108225, 10500) 
    print numpy.max(R) #4176.26290975
    return R

results in the following times:

R1  11.7737319469 (108225, 10500) 4909.66881791
R2  15.1279799938 (108225, 10500) 4909.66881791
R3  12.7408981323 (108225, 10500) 4909.66881791
R4  17.3336868286 (10500, 108225) 4909.66881791
R5  15.7530870438 (10500, 108225) 70.0690289494
R6  11.670968771 (108225, 10500) 4909.66881791

While the last one gives sqrt((VVm-VVs)^2+(HHm-HHs)^2), while the others give (VVm-VVs)^2+(HHm-HHs)^2, This is not really important, since otherwise further on in my code i take the minimum of R[i,:] for each i, and sqrt doesnt influence the minimum value anyways, (and if i am interested in the distance, i just take sqrt(value), instead of doing the sqrt over the entire array, so there is really no timing difference due to that.

The question remains: how come the first solution is the best, (the reason the second and third are slower is because deltas=... takes 5.8seconds, (which is also why those two methods take 26Gb)), And why is the sqeuclidean slower than the euclidean?

sqeuclidean should just do (VVm-VVs)^2+(HHm-HHs)^2, while i think it does something different. Anyone know how to find the sourcecode (C or whatever is at the bottom) of that method? I think it does sqrt((VVm-VVs)^2+(HHm-HHs)^2)^2 (the only reason i can think why it would be slower than (VVm-VVs)^2+(HHm-HHs)^2 - I know its a stupid reason, anyone got a more logical one?)

Since i know nothing of C, how would i inline this with scipy.weave? and is that code compilable normally like you do with python? or do i need special stuff installed for that?

Edit: ok, i tried it with scipy.weave.blitz, (R6 method), and that is slightly faster, but i assume someone who knows more C than me can still improve this speed? I just took the lines which are of the form a+=b or *=, and looked up how they would be in C, and put them in the blitz statement, but i guess if i put lines with the statements with flatten and newaxis in C as well, that it should go faster too, but i dont know how i can do that (someone who knows C maybe explain?). Right now, the difference between the stuff with blitz and my first method are not big enough to really be caused by C vs numpy i guess?

I guess the other methods like with deltas=... can go much faster too, when i would put it in C ?

Pipsissewa answered 8/7, 2013 at 13:1 Comment(8)

consider trying something along the lines of jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2 (esp 'numpy with broadcasting' part) – Ene 8/7, 2013 at 13:8

You could probably shave a few seconds off by not allocating memory for R (i.e., just use R1 += R3). – Quilt 8/7, 2013 at 13:11

@Quilt yeah, same way as R1*=R1, but still, that wont reduce it down to like 1sec or so, (which i assume should happen when it is fully in C from numpy)? – Pipsissewa 8/7, 2013 at 13:12

Not down to 1 sec but if you're using 32-bit floats, that would save you having to allocate about 4 GB of RAM, which is significant. And if it lets you avoid using swap, then it will be a significant improvement. I'll be surprised it can get down to 1 sec in C (even with no python), considering how much memory it requires (unless you have a lot of RAM and are significantly multithreaded) – Quilt 8/7, 2013 at 13:16

got about 50Gb of ram, so its not swapping yet – Pipsissewa 8/7, 2013 at 13:30

Lucky you! I suspect you'd still make a nontrival improvement by avoiding the unnecessary allocation. If you haven't already done so, you may want to verify that you are running multithreaded versions of the lower level libraries (BLAS, ATLAS). – Quilt 8/7, 2013 at 13:45

Well, i was more like thinking at something like numpy.linalg.norm or so, which does the squaring etc in the internal C, but the problem is that that takes the norm as if it were an entire matrix, so results in 1 number, instead of in the thing i want, which is a matrix, – Pipsissewa 8/7, 2013 at 13:53

After some editing, the question is still the same, but far more structured, and with more timings – Pipsissewa 10/7, 2013 at 13:30

Whenever you have multiplications and sums, try to use one of the dot product functions or np.einsum. Since you are preallocating your arrays, rather than having different arrays for horizontal and vertical coordinates, stack them both together:

precomputed_flat = np.column_stack((svf.flatten(), shf.flatten()))
measured_flat = np.column_stack((VVmeasured.flatten(), HHmeasured.flatten()))
deltas = precomputed_flat - measured_flat[:, None, :]

From here, the simplest would be:

dist = np.einsum('ijk,ijk->ij', deltas, deltas)

You could also try something like:

from numpy.core.umath_tests import inner1d
dist = inner1d(deltas, deltas)

There is of course also SciPy's spatial module cdist:

from scipy.spatial.distance import cdist
dist = cdist(precomputed_flat, measured_flat, 'euclidean')

EDIT I cannot run tests on such a large dataset, but these timings are rather enlightening:

len_a, len_b = 10000, 1000

a = np.random.rand(2, len_a)
b =  np.random.rand(2, len_b)
c = np.random.rand(len_a, 2)
d = np.random.rand(len_b, 2)

In [3]: %timeit a[:, None, :] - b[..., None]
10 loops, best of 3: 76.7 ms per loop

In [4]: %timeit c[:, None, :] - d
1 loops, best of 3: 221 ms per loop

For the above smaller dataset, I can get a slight speed up over your method with scipy.spatial.distance.cdist and match it with inner1d, by arranging data differently in memory:

precomputed_flat = np.vstack((svf.flatten(), shf.flatten()))
measured_flat = np.vstack((VVmeasured.flatten(), HHmeasured.flatten()))
deltas = precomputed_flat[:, None, :] - measured_flat

import scipy.spatial.distance as spdist
from numpy.core.umath_tests import inner1d

In [13]: %timeit r0 = a[0, None, :] - b[0, :, None]; r1 = a[1, None, :] - b[1, :, None]; r0 *= r0; r1 *= r1; r0 += r1
10 loops, best of 3: 146 ms per loop

In [14]: %timeit deltas = (a[:, None, :] - b[..., None]).T; inner1d(deltas, deltas)
10 loops, best of 3: 145 ms per loop

In [15]: %timeit spdist.cdist(a.T, b.T)
10 loops, best of 3: 124 ms per loop

In [16]: %timeit deltas = a[:, None, :] - b[..., None]; np.einsum('ijk,ijk->jk', deltas, deltas)
10 loops, best of 3: 163 ms per loop

Alimentary answered 8/7, 2013 at 14:5 Comment(13)

alternatively to np.einsum one can use np.tensordot(), which has also a very flexible notation... – Badenpowell 8/7, 2013 at 14:10

Sadly, all 3 methods you suggest are slower, (the deltas=... takes six seconds already, which is why they are slower) – Pipsissewa 8/7, 2013 at 15:9

Funny how memory management ruins the best laid plans... I don't fully understand what's going on, but see my edit. You may want to try the above methods on your huge arrays to see if the timings behave differently, but there may be some margin to win with scipy. – Alimentary 8/7, 2013 at 16:15

Your fastest way still is slower than mine, but i think it is because in my case, i just do (x-x')**2+(y-y')**2, (later on, i get the minimal value from it, but taking the sqrt of it wont change the place of the minimal value, so i dont do that in my calculation, while the cdist does that too (i assume), (the one [15]: times out at 15.7 sec, while mine times out at 11.8 sec, but i assume its due to the sqrt being taken in the cdist routine, but i really think this way, if theres some routine like that without the sqrt in it, it would end up much faster – Pipsissewa 9/7, 2013 at 9:48

According to the docs, spdist.cdist(a.T, b.T, 'sqeuclidean') should do just that, can't test it right now. Anyway, interesting to see how memory handling becomes everything when you are using a lot of it! – Alimentary 9/7, 2013 at 12:35

the one where you just use R=spdist.cdist(precomputed_flat.T, measured_flat.T).T times out at 15.8sec, while the one with sqeuclidean times out at 17.2sec, which makes me think that in the implementation of sqeuclidean they take sqrt(a2+b2)**2, instead of just a2+b2, because otherwise why would doing a calculation less actually result in needing more time – Pipsissewa 9/7, 2013 at 15:23

Where did you find the next part: from numpy.core.umath_tests import inner1d Because umath_tests i cant even find that file or directory or so? (i do find numpy.core.umath_tests.so though), because i wanted to take a look at what is in the description of that inner1d method, or is that somehow in that.so file? (i dont really know whats in the .so files, only that i cant open them somehow?) – Pipsissewa 10/7, 2013 at 12:36

inner1d will be written in C, and is probably just broadcasting plus calls to the dot product functions of the BLAS library. – Alimentary 10/7, 2013 at 13:32

So that explains why R2 is slower than R3, but why is R4 slower than R5, (and why is R1 fastest of em all?) – Pipsissewa 10/7, 2013 at 13:56

im trying to wrap my head around the axes argument of numpy.tensordot, just to see if its faster than numpy.einsum, but i cant seem to manage to figure out how they work – Pipsissewa 10/7, 2013 at 14:54

I don't think you can get numpy.tensordot to do what you are after. You would have to do something like numpy.tensordot(deltas, deltas, ((2,), (2,))), but the return would be a rank 4 tensor, and your answer would be buried into some diagonal plane of it. It would blow up memory use to infinity and beyond! – Alimentary 10/7, 2013 at 15:41

yeah, i got to that point already of having a massive 4D tensor, and figured i was doing something wrong (and got memory issues with that method too, but i thought that was because i was doing something wrong), but if you say its not possible to get a straight answer out of tensordot than i guess that method wont be the faster answer either. – Pipsissewa 11/7, 2013 at 6:27

Since sqeucl is slower than eucl, i guess with scipy.weave it would go faster? But i know nothing of C (except that its fast), and in weave i would need to put the C code of the line that causes trouble as a quote or so, but i am not sure how it would work (both the programming and the technical explanation) – Pipsissewa 11/7, 2013 at 7:1

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags