Fastest pairwise distance metric in python
Asked Answered
X

3

21

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.

Here's an example that gives me what I want with an array of 1000 numbers.

import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])

What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.

Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.

Xanthous answered 29/11, 2013 at 3:40 Comment(13)
There's a function for that: scipy.spatial.distance.pdist. I dunno whether this is the fastest option, since it needs to have checks for multidimensional data, non-Euclidean norms, and other things, but it's built in.Romie
How fast do you need this to be? It's never going to scale better than O(n^2), since you have to populate n^2 entries of output. Your existing solution is O(n^2), and there doesn't seem to be much room for major optimizations.Romie
This seems to scale to >10k values well enough already when I try it. Remember that you need to populate 100 million entries of output. That's almost half a gigabyte of pairwise distances.Romie
seconded to @Romie and @askewchan, but make sure your numpy is compiled with BLAS or MKL, the one you download straightly from sourceforge is likely not.Dover
@askewchan I don't think it does... If you follow the source code, in the end this is the function getting called. Not only is there no fancy optimization, but for 1D vectors it is squaring and taking the square root to compute the absolute value. Probably worse than the OP's code for his particular use case.Conrado
@CTZhu If I'm not mistaken, scipy is always compiled with BLAS, it's not optional as with numpy.Impetus
@Conrado Ah, I stand corrected. Thanks for actually looking it up :)Impetus
On the other hand, calling pdist with 'cityblock' for the metric should do the trick.Conrado
Did you check this thread? #17527840 There i had a similar problem, when the arrays get big, you have to abuse memory layout to get speedupsCustomable
@askewchan: scipy.spatial.distance only calls BLAS for a limited number of cases, and only after doing some more operations because BLAS has no distance functions. It's not faster for the OP's use case.Baccarat
You're correct @larsmans, as Jaime has also said. I have deleted the comment.Impetus
do you mind giving a link the question you are referring to? What post in SO?Stanley
sorry - this was so long ago that I don't remember what original post it was. But I do know it was somewhere on SO!Xanthous
X
33

Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.

Here's some code:

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

def option1(r):
    dists = np.abs(r - r[:, None])

def option2(r):
    dists = scipy.spatial.distance.pdist(r, 'cityblock')

def option3(r):
    dists = sklearn.metrics.pairwise.manhattan_distances(r)

Timing with IPython:

In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop

In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop

In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop

I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).

Xanthous answered 2/12, 2013 at 0:59 Comment(3)
I assume this is as fast as: scikit-learn.org/stable/modules/generated/… ? the version in sklearn?Stanley
In the scipy.spatial.distance.pdist case, should it be c instead of r?Fredra
@Fredra I think soClaus
S
7

Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.

I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:

File my_cython.pyx:

import numpy as np
cimport numpy as np
import cython

cdef extern from "math.h":
    double abs(double t)

@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
    cdef int i, j, c, size
    cdef np.ndarray[np.double_t, ndim=1] ans
    size = sum(range(1, r.shape[0]+1))
    ans = np.empty(size, dtype=r.dtype)
    c = -1
    for i in range(r.shape[0]):
        for j in range(i, r.shape[0]):
            c += 1
            ans[c] = abs(r[i] - r[j])
    return ans

The answer is a 1-D array containing all non-repeated evaluations.

To import into Python:

import numpy as np
import random

import pyximport; pyximport.install()
from my_cython import pairwise_distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)

def solOP(r):
    return np.abs(r - r[:, None])

Timing with IPython:

In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop

In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop
Syllabogram answered 29/11, 2013 at 11:3 Comment(1)
Surely you meant fabs -- abs is int variant.Baccarat
S
7

Using half the memory, but 6 times slower than np.abs(r - r[:, None]):

triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])
Sunglass answered 30/11, 2013 at 23:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.