Efficient distance calculation between N points and a reference in numpy/scipy
Asked Answered
A

8

28

I just started using scipy/numpy. I have an 100000*3 array, each row is a coordinate, and a 1*3 center point. I want to calculate the distance for each row in the array to the center and store them in another array. What is the most efficient way to do it?

Assail answered 21/6, 2011 at 18:21 Comment(2)
possible duplicate of calculate euclidean distance with numpyFoursome
@larsmans: I don't think it's a duplicate since the answers only pertain to the distance between two points rather than the distance between N points and a reference point. And certainly the responses don't point the OP to the efficient scipy solution that I show below.Uncompromising
U
38

I would take a look at scipy.spatial.distance.cdist:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

import numpy as np
import scipy

a = np.random.normal(size=(10,3))
b = np.random.normal(size=(1,3))

dist = scipy.spatial.distance.cdist(a,b) # pick the appropriate distance metric 

dist for the default distant metric is equivalent to:

np.sqrt(np.sum((a-b)**2,axis=1))  

although cdist is much more efficient for large arrays (on my machine for your size problem, cdist is faster by a factor of ~35x).

Uncompromising answered 21/6, 2011 at 18:24 Comment(3)
In this answer, where is the single reference point?Aeromarine
b is the single refence point in three dimensions, a is 10 other points in three dimensions.Fecund
in case b has more points (pairs) np.sqrt(np.sum((hs[:, None] - an)**2, axis=2))Credible
D
7

I would use the sklearn implementation of the euclidean distance. The advantage is the usage of the more efficient expression by using Matrix multiplication:

dist(x, y) = sqrt(np.dot(x, x) - 2 * np.dot(x, y) + np.dot(y, y)

A simple script would look like this:

import numpy as np

x = np.random.rand(1000, 3)
y = np.random.rand(1000, 3)

dist = np.sqrt(np.dot(x, x)) - (np.dot(x, y) + np.dot(x, y)) + np.dot(y, y)

The advantage of this approach has been nicely described in the sklearn documentation: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances

I am using this approach to crunch large datamatrices (10000, 10000) with some minor modifications like using the np.einsum function.

Discovery answered 21/7, 2014 at 17:4 Comment(2)
doesn't address the question of calculating against a single reference pointHausmann
numpy.sqrt((X**2).sum(axis=1)[:, None] - 2 * X.dot(Y.transpose()) + ((Y**2).sum(axis=1)[None, :])Bort
B
1

You can also use the development of the norm (similar to remarkable identities). This is probably the most efficent way to compute the distance of a matrix of points.

Here is a code snippet that I originally used for a k-Nearest-Neighbors implementation, in Octave, but you can easily adapt it to numpy since it only uses matrix multiplications (the equivalent is numpy.dot()):

% Computing the euclidian distance between each known point (Xapp) and unknown points (Xtest)
% Note: we use the development of the norm just like a remarkable identity:
% ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*<x1,x2>
[napp, d] = size(Xapp);
[ntest, d] = size(Xtest);

A = sum(Xapp.^2, 2);
A = repmat(A, 1, ntest);

B = sum(Xtest.^2, 2);
B = repmat(B', napp, 1);

C = Xapp*Xtest';

dist = A+B-2.*C;
Brethren answered 5/4, 2013 at 21:11 Comment(0)
M
1

This might not answer your question directly, but if you are after all permutations of particle pairs, I've found the following solution to be faster than the pdist function in some cases.

import numpy as np

L   = 100       # simulation box dimension
N   = 100       # Number of particles
dim = 2         # Dimensions

# Generate random positions of particles
r = (np.random.random(size=(N,dim))-0.5)*L

# uti is a list of two (1-D) numpy arrays  
# containing the indices of the upper triangular matrix
uti = np.triu_indices(100,k=1)        # k=1 eliminates diagonal indices

# uti[0] is i, and uti[1] is j from the previous example 
dr = r[uti[0]] - r[uti[1]]            # computes differences between particle positions
D = np.sqrt(np.sum(dr*dr, axis=1))    # computes distances; D is a 4950 x 1 np array

See this for a more in-depth look on this matter, on my blog post.

Merkel answered 23/4, 2017 at 11:40 Comment(0)
U
1

Scipy's cdist is nice, but you can do the same in raw Numpy, np.sqrt(np.sum((points-center)**2, axis=1)) and, at least on my laptop, using IPython's %timeit I have

no. of points 100 1000 10000 100000 1000000
raw Numpy 7.82 µs 24.5 µs 187 µs 2.03 ms 23.6 ms
cdict 4.87 µs 27.1 µs 247 µs 2.45 ms 24.4 ms
In [17]: import numpy as np
    ...: from scipy.spatial.distance import cdist
    ...: 
    ...: center = np.random.rand(1,3)
    ...: points = np.random.rand(1000000,3)
    ...: 
    ...: for i in (100, 1000, 10000, 100000, 1000000-1):
    ...:     %timeit np.sqrt(np.sum((points[:i]-center)**2, axis=1))
    ...: 
    ...: for i in (100, 1000, 10000, 100000, 1000000-1):
    ...:     %timeit cdist(points[:i], center)
7.82 µs ± 23.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
24.5 µs ± 79 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
187 µs ± 841 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
2.03 ms ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.6 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.87 µs ± 22.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
27.1 µs ± 3.56 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
247 µs ± 69 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.45 ms ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
24.4 ms ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [18]: np.sqrt(np.sum((points[:i]-center)**2, axis=1))
Out[18]: 
array([0.89425032, 0.22533743, 0.54540411, ..., 0.36515514, 0.31757802,
       0.59976782])

In [19]: cdist(points[:i], center)
Out[19]: 
array([[0.89425032],
       [0.22533743],
       [0.54540411],
       ...,
       [0.36515514],
       [0.31757802],
       [0.59976782]])

In [20]: 
Unshaped answered 22/2 at 13:20 Comment(0)
S
0

You may need to specify a more detailed manner the distance function you are interested of, but here is a very simple (and efficient) implementation of Squared Euclidean Distance based on inner product (which obviously can be generalized, straightforward manner, to other kind of distance measures):

In []: P, c= randn(5, 3), randn(1, 3)
In []: dot(((P- c)** 2), ones(3))
Out[]: array([  8.80512,   4.61693,   2.6002,   3.3293,  12.41800])

Where P are your points and c is the center.

Simonize answered 21/6, 2011 at 21:22 Comment(2)
On my machine this is still 18x slower than cdist for the OP's problem size.Uncompromising
@JoshAdel: That's big difference. FWIW, with numpy 1.6 in my modest machine: for n= 1e5, timing s are cdist 3.5 ms and dot 9.5 ms. So dotis only some 3 times slower. However with much smaller n (<2e3) 'dot' will be faster. ThanksSimonize
A
0
#is it true, to find the biggest distance between the points in surface?

from math import sqrt

n = int(input( "enter the range : "))
x = list(map(float,input("type x coordinates: ").split()))
y = list(map(float,input("type y coordinates: ").split()))
maxdis = 0  
for i in range(n):
    for j in range(n):
        print(i, j, x[i], x[j], y[i], y[j])
        dist = sqrt((x[j]-x[i])**2+(y[j]-y[i])**2)
        if maxdis < dist:

            maxdis = dist
print(" maximum distance is : {:5g}".format(maxdis))
Aiglet answered 16/11, 2018 at 7:2 Comment(1)
Please explain your solutionPride
I
0

Using scipy.cdist is the best solution, but if you are not allowed to use anything other than numpy, you can define a function like this:

def pairwise_distances(x, y):
    """
    Compute pair-wise distances between points in x and y.

    Parameters:
        x (ndarray): Numpy array of shape (n_samples_x, n_features).
        y (ndarray): Numpy array of shape (n_samples_y, n_features).

    Returns:
        ndarray: Numpy array of shape (n_samples_x, n_samples_y) containing
        the pair-wise distances between points in x and y.
    """
    # Reshape x and y to enable broadcasting
    x_reshaped = x[:, np.newaxis, :]  # Shape: (n_samples_x, 1, n_features)
    y_reshaped = y[np.newaxis, :, :]  # Shape: (1, n_samples_y, n_features)

    # Compute pair-wise distances using Euclidean distance formula
    pairwise_distances = np.sqrt(np.sum((x_reshaped - y_reshaped)**2, axis=2))

    return pairwise_distances


# Now, let's check the above function
x = np.random.rand(1000, 3)
y = np.random.rand(10, 3)
pair_dist = pairwise_distances(x, y)
print(pair_dist.shape)
Impala answered 21/2 at 8:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.