Efficiently Calculating a Euclidean Distance Matrix Using Numpy
Asked Answered
V

6

32

I have a set of points in 2-dimensional space and need to calculate the distance from each point to each other point.

I have a relatively small number of points, maybe at most 100. But since I need to do it often and rapidly in order to determine the relationships between these moving points, and since I'm aware that iterating through the points could be as bad as O(n^2) complexity, I'm looking for ways to take advantage of numpy's matrix magic (or scipy).

As it stands in my code, the coordinates of each object are stored in its class. However, I could also update them in a numpy array when I update the class coordinate.

class Cell(object):
    """Represents one object in the field."""
    def __init__(self,id,x=0,y=0):
        self.m_id = id
        self.m_x = x
        self.m_y = y

It occurs to me to create a Euclidean distance matrix to prevent duplication, but perhaps you have a cleverer data structure.

I'm open to pointers to nifty algorithms as well.

Also, I note that there are similar questions dealing with Euclidean distance and numpy but didn't find any that directly address this question of efficiently populating a full distance matrix.

Vannesavanness answered 28/3, 2014 at 18:47 Comment(3)
Here, this might help: scipy.spatial.distance.pdistCrossness
Complexity is going to be O(n^2) no matter what: the best you can do for a general set of points is to only compute n * (n - 1) / 2 distances, which is still O(n^2).Zebrawood
If scipy can be used, consider scipy.spatial.distance_matrixKerekes
L
43

You can take advantage of the complex type :

# build a complex array of your cells
z = np.array([complex(c.m_x, c.m_y) for c in cells])

First solution

# mesh this array so that you will have all combinations
m, n = np.meshgrid(z, z)
# get the distance via the norm
out = abs(m-n)

Second solution

Meshing is the main idea. But numpy is clever, so you don't have to generate m & n. Just compute the difference using a transposed version of z. The mesh is done automatically :

out = abs(z[..., np.newaxis] - z)

Third solution

And if z is directly set as a 2-dimensional array, you can use z.T instead of the weird z[..., np.newaxis]. So finally, your code will look like this :

z = np.array([[complex(c.m_x, c.m_y) for c in cells]]) # notice the [[ ... ]]
out = abs(z.T-z)

Example

>>> z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])
>>> abs(z.T-z)
array([[ 0.        ,  2.23606798,  4.12310563],
       [ 2.23606798,  0.        ,  4.24264069],
       [ 4.12310563,  4.24264069,  0.        ]])

As a complement, you may want to remove duplicates afterwards, taking the upper triangle :

>>> np.triu(out)
array([[ 0.        ,  2.23606798,  4.12310563],
       [ 0.        ,  0.        ,  4.24264069],
       [ 0.        ,  0.        ,  0.        ]])

Some benchmarks

>>> timeit.timeit('abs(z.T-z)', setup='import numpy as np;z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])')
4.645645342274779
>>> timeit.timeit('abs(z[..., np.newaxis] - z)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
5.049334864854522
>>> timeit.timeit('m, n = np.meshgrid(z, z); abs(m-n)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
22.489568296184686
Lives answered 28/3, 2014 at 19:27 Comment(2)
Did you ever find the distance? If so, you lost me. Where did that happen?Vannesavanness
@WesModes, it's a bit late answer, but still might be useful. A complex number is basically a two-dimensional point. A difference of two complex numbers is a complex number. The absolute value of a complex number is the distance from (0, 0) to the point.Predator
C
14

If you don't need the full distance matrix, you will be better off using kd-tree. Consider scipy.spatial.cKDTree or sklearn.neighbors.KDTree. This is because a kd-tree kan find k-nearnest neighbors in O(n log n) time, and therefore you avoid the O(n**2) complexity of computing all n by n distances.

Cyclades answered 29/3, 2014 at 20:37 Comment(0)
S
13

Jake Vanderplas gives this example using broadcasting in Python Data Science Handbook, which is very similar to what @shx2 proposed.

import numpy as np
rand = random.RandomState(42)
X = rand.rand(3, 2)  
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis = -1)

dist_sq
array([[0.        , 0.18543317, 0.81602495],
       [0.18543317, 0.        , 0.22819282],
       [0.81602495, 0.22819282, 0.        ]])
Splendiferous answered 6/1, 2019 at 0:28 Comment(6)
scipy.spatial.distance.cdist is faster than this, 9 times in my testButcherbird
@Butcherbird - you should write an answer with a call to %timeit, perhaps for a small (10x10) and large (1,000,000 x 1,000,000) distance matrix. That would be really useful information for people!Splendiferous
i can not use %timeit in my jupyter notebook because i used the online variant and it runs out of memory for arrays that bigButcherbird
This is a super fast solution.Hideous
This solution is a great example of broadcasting, but it consumes Θ(n^2 * d) memory (where n is the number of vectors and d is the dimension), whereas an optimal solution would only consume O(n^2). (Confirmed by /usr/bin/time -v.)Anjanetteanjela
How would you compute the Manhattan distance?Hideous
P
8

Here is how you can do it using numpy:

import numpy as np

x = np.array([0,1,2])
y = np.array([2,4,6])

# take advantage of broadcasting, to make a 2dim array of diffs
dx = x[..., np.newaxis] - x[np.newaxis, ...]
dy = y[..., np.newaxis] - y[np.newaxis, ...]
dx
=> array([[ 0, -1, -2],
          [ 1,  0, -1],
          [ 2,  1,  0]])

# stack in one array, to speed up calculations
d = np.array([dx,dy])
d.shape
=> (2, 3, 3)

Now all is left is computing the L2-norm along the 0-axis (as discussed here):

(d**2).sum(axis=0)**0.5
=> array([[ 0.        ,  2.23606798,  4.47213595],
          [ 2.23606798,  0.        ,  2.23606798],
          [ 4.47213595,  2.23606798,  0.        ]])
Polymerize answered 28/3, 2014 at 19:21 Comment(1)
This actually takes quite some memory if you have large x or y, while also being slow. SciPy's distance matrix should be quite somewhat faster.Checkmate
J
5

If you are looking for the most efficient way of computation - use SciPy's cdist() (or pdist() if you need just vector of pairwise distances instead of full distance matrix) as suggested in Tweakimp's comment. As he said it's a lot faster than method based on vectorization and broadcasting, proposed by RichPauloo and shx2. The reason for that is that SciPy's cdist() and pdist() under the hood use for loop and C implementations for metric computations, which are even faster than vectorization.

By the way, if you can use SciPy and still prefer method using broadcasting, you don't have to implement it by yourself, as distance_matrix() function is pure Python implementation, which leverages broadcasting and vectorization (source code, docs).

It's worth mentioning that cdist()/pdist() is also more efficient than broadcasting memory-wise, as it computes distances one by one and avoids creating arrays of n*n*d elements, where n is number of points and d is points' dimensionality.

Experiments

I've conducted some simple experiments to compare performance of SciPy's cdist(), distance_matrix() and broadcasting implementation in NumPy. I used perf_counter_ns() from Python's time module to measure time and all the results are averaged over 10 runs on 10000 points in 2D space using np.float64 datatype (tested on Python 3.8.10, Windows 10 with Ryzen 2700 and 16 GB RAM):

  • cdist() - 0.6724s
  • distance_matrix() - 3.0128s
  • my NumPy implementation - 3.6931s

Code if someone wants to reproduce experiments:

from scipy.spatial import *
import numpy as np
from time import perf_counter_ns


def dist_mat_custom(a, b):
    return np.sqrt(np.sum(np.square(a[:, np.newaxis, :] - b[np.newaxis, :, :]), axis=-1))


results = []
size = 10000
it_num = 10
for i in range(it_num):
    a = np.random.normal(size=(size, 2))
    b = np.random.normal(size=(size, 2))
    start = perf_counter_ns()
    c = distance_matrix(a, b)
    #c = dist_mat_custom(a, b)
    #c = distance.cdist(a, b)
    results.append(perf_counter_ns() - start)
print(np.mean(results) / 1e9)
Jansenism answered 11/6, 2022 at 20:31 Comment(0)
W
0

If you have normalized vectors you could normally use cosine similarity which could be calculated much faster (by orders of magnitude):

dist_matrix = 1 - np.matmul(vectors, vectors.T)

Note that it is different from Euclidian distance, but it gives the same result when comparing distances.

It could be useful for huge distance matrices.

Walton answered 25/4, 2024 at 17:32 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.