Compute the pairwise distance in scipy with missing values

Asked 16/7, 2014 at 13:0 Answered 1/8, 2020 at 12:9

I'm a bit stumped by how scipy.spatial.distance.pdist handles missing (nan) values.

So just in case I messed up the dimensions of my matrix, let's get that out of the way. From the docs:

The points are arranged as m n-dimensional row vectors in the matrix X.

So let's generate three points in 10 dimensional space with missing values:

numpy.random.seed(123456789)
data = numpy.random.rand(3, 10) * 5
data[data < 1.0] = numpy.nan

If I compute the Euclidean distance of these three observations:

pdist(data, "euclidean")

I get:

array([ nan,  nan,  nan])

However, if I filter all the columns with missing values I do get proper distance values:

valid = [i for (i, col) in enumerate(data.T) if ~numpy.isnan(col).any()]
pdist(data[:, valid], "euclidean")

I get:

array([ 3.35518662,  2.35481185,  3.10323893])

This way, I throw away more data than I'd like since I don't need to filter the whole matrix but only the pairs of vectors being compared at a time. Can I make pdist or a similar function perform pairwise masking, somehow?

Edit:

Since my full matrix is rather large, I did some timing tests on the small data set provided here.

1.) The scipy function.

%timeit pdist(data, "euclidean")

10000 loops, best of 3: 24.4 µs per loop

2.) Unfortunately, the solution provided so far is roughly 10 times slower.

%timeit numpy.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()

1000 loops, best of 3: 231 µs per loop

3.) Then I did a test of "pure" Python and was pleasantly surprised:

from scipy.linalg import norm

%%timeit
m = data.shape[0]
dm = numpy.zeros(m * (m - 1) // 2, dtype=float)
mask = numpy.isfinite(data)
k = 0
for i in range(m - 1):
    for j in range(i + 1, m):
        curr = numpy.logical_and(mask[i], mask[j])
        u = data[i][curr]
        v = data[j][curr]
        dm[k] = norm(u - v)
        k += 1

10000 loops, best of 3: 98.9 µs per loop

So I think the way to go forward is to Cythonize the above code in a function.

Carpathoukraine answered 16/7, 2014 at 13:0 Comment(6)

If you cythonize that, you would probably not look nicer building it around masked arrays directly? Also be careful about extraploating performance of small samples to larger samples... – Lubumbashi 17/7, 2014 at 12:58

Sorry for my worst-grammared comment so far. Running on 1000 points in 100 dim space still makes python solution faster (2.65x). – Lubumbashi 17/7, 2014 at 13:31

@Lubumbashi What do you mean by "building it around masked arrays directly"? You still need to use the mask supplied to an ma.array if you want to hide data, right? Maybe you could edit your answer or we have a little chat. – Carpathoukraine 17/7, 2014 at 14:21

Exactly so, but the python-code will look neater. I dug around github.com/scipy/scipy/blob/v0.13.0/scipy/spatial/src/… too, don't know if that would be of assistance. Could chat, but don't know if I have much more to offer. – Lubumbashi 17/7, 2014 at 14:30

@Midnighter, did you get any further speed-up on your code? – Pole 13/4, 2016 at 16:17

I think another way to speed up this comparison is to use multiprocessing module and let different cores work on different chunk of data. – Pole 13/4, 2016 at 16:46

Actually you might be better off with this ready made solution: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html

However the drawback it seems to be that it is more tricky to apply weights when there are missing values

Digitate answered 1/8, 2020 at 12:9 Comment(1)

That would have been a great solution indeed but it was only added in December 2019 while the question is a few years old by now. I will accept your answer because it is the function I'd use today. – Carpathoukraine 2/8, 2020 at 18:49

If I understand you correctly, you want the distance for all dimensions that two vector have valid values for.

Unfortunately pdist doesn't understand masked arrays in that sense, so I modified your semi-solution to not reduce information. It is however not the most efficient solution, nor most readable:

np.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()

The outer making it to an array and ravel is just to get it in a matching shape to what you would expect.

itertools.combinations produces all pairwise possible indices of the data-array.

I then just slice data on these (must be a list and not a tuple to slice correctly) and do the pairwise filtering of nan just as your code did.

Lubumbashi answered 16/7, 2014 at 15:23 Comment(0)

Actually you might be better off with this ready made solution: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html

However the drawback it seems to be that it is more tricky to apply weights when there are missing values

Digitate answered 1/8, 2020 at 12:9 Comment(1)

Recommended topics

Hot tags