distance matrix of curves in python

Asked 3/12, 2012 at 22:20 Answered 26/11, 2015 at 10:13

I have a set of curves defined as 2D arrays (number of points, number of coordinates). I am calculating a distance matrix for them using Hausdorff distance. My current code is as follows. Unfortunately it is too slow with 500-600 curves each having 50-100 3D points. Is there any faster way for that?

def distanceBetweenCurves(C1, C2):
    D = scipy.spatial.distance.cdist(C1, C2, 'euclidean')

    #none symmetric Hausdorff distances
    H1 = np.max(np.min(D, axis=1))
    H2 = np.max(np.min(D, axis=0))

    return (H1 + H2) / 2.

def distanceMatrixOfCurves(Curves):
    numC = len(Curves)

    D = np.zeros((numC, numC))
    for i in range(0, numC-1):
        for j in range(i+1, numC):
            D[i, j] = D[j, i] = distanceBetweenCurves(Curves[i], Curves[j])

    return D

Erepsin answered 3/12, 2012 at 22:20 Comment(8)

Is scipy.spatial.distance.cdist the slow piece or the double loop inside distanceMatrixOfCurves ? If these curves are convex it might be possible to optimize the first possible slow piece. Do these curves intersect or are contained inside others ? I feel like you could reuse earlier found distances to speed up new calculations. This is just babbling of course, I have myself a similar issue with min(min(..)) measures and had trouble to generalize these considerations I'm putting here. Did you try or think about anything beyond the code you posted ? – Refuge 3/12, 2012 at 22:44

I tried to implement Euclidean distance myself (instead of using cdist), nothing much changed. I think problem is with double loop. Curves (some of them) intersect and are contained inside others... – Erepsin 3/12, 2012 at 22:59

@Erepsin you should profile your code to be sure (cProfile + runsnakerun to interpret results are great) what the exact bottleneck is. I don't have a great feel for these things, but you may be able to get away without allocating the large all-pairs matrix of distances that cdist computes - if you can add some code that generates some small example data it will be easier to help you. – Groyne 3/12, 2012 at 23:57

@MrE while it is certainly possible to get ride of cdist (just use a form of wave propagation) that might cost more memory and in the worst case it doesn't help. This doesn't mean it can't help in reducing execution time, but it is problematic. – Refuge 4/12, 2012 at 0:7

Sure, I don't think there's an easy solution. My only other suggestion would be to store the points in a spatial data structure like a kd-tree for each curve, which would at least accelerate the search for the nearest neighbour to a given point, but may well end up being slower overall. Interested to see what people suggest here. – Groyne 4/12, 2012 at 0:13

1. Do you really need the full matrix D, or can you live with just upper- or lower-triangular matrix? This D[i,j] = D[j,i] =... is definitely no good for data locality; 2. Have you tried using list compresensions or map instead of double loops? – Baca 4/12, 2012 at 0:15

This doesn't reduce the computational complexity in any way. As it stands, I think the OP will just get minor-implementation-improvements, i.e. use library/language/package X because it runs Y (where Y is the same method maybe with some minor touches) times faster! and etc.. I hope I'm proved wrong here. – Refuge 4/12, 2012 at 2:3

@Zhenya I do not need the whole matrix, you'r right. Do you think list comprehensions is faster than loops? I can try this... – Erepsin 5/12, 2012 at 0:28

Your question might also be related to this one

This is kind of a hard problem. A possible way would be to implement the euclidian distance on your own, completely abandon scipy and make use of pypy's JIT compiler. But most likely this will not make you gane much.

Personally, I would recommend you to write the routine in C.

The problem is less the implementation but the way you approach this problem. You chose a brute force approach by calculating the euclidian distance for each distinct pair of points in each possible pair of the metric space subsets. This is computationally demanding:

Assume you have 500 curves and each of them has 75 points. With the brute force approach you end up calculating the euclidean distance 500 * 499 * 75 * 75 = 1 403 437 500 times. It is not further surprising that this approach takes forever to run.

I'm not an expert with this but I know that the Hausdorff distance is extensively used in image processing. I would suggest you to browse the literature for speed optimized algorithms. A starting point might be this, or this paper. Also, often mentioned in combination with the Hausdorff distance is the Voroni diagram.

I hope these links might help you with this problem.

Epode answered 4/12, 2012 at 10:5 Comment(5)

The first linked paper is interesting, solves part of the problem. Now the other problem can be more or less formulated as this: given three sets A B C, if we know the Hausdorff distance between A and B and also the inner details obtained from such calculation, is it possible to find the Hausdorff distance from A to C with lesser computational complexity than it was to calculate for A and B if there are paths, considering some metric, from A to C that cross B ? Well, that was way longer than I expected when I started to write. I hope someone can make sense of it. – Refuge 4/12, 2012 at 18:50

thank you for all your advices. I got the idea: "use a smarter algorithm". I hope I'll do it after using the current version for the paper deadline :) – Erepsin 5/12, 2012 at 0:23

@Erepsin I'm sorry to not be able to help you more than that, but there is no easy solution for this problem. One idea for speeding up the computation I would have though: Instead of computing for a given point in Curve1 the euclid.dist. with all the points from Curve2, you could choose randomly a point from Curve2, compute the euclid.dist. (call it r) then for each other point from Curve2 you can compute the manhatton distance (much faster to compute) and check if each dimension of the manhatton dist. is bigger than r, if so... – Epode 5/12, 2012 at 9:38

@Erepsin the arbitrarily chosen point is closer and you do not need to compute the euclid.dist. (if you are not sure why:triangular inequality). if not, compute the euclid.dist. and if it is smaller than r, update r. You will still be running through all the points in Curve2, but you will often just have to compute the manhatton dist., which is faster. ...well, this is just an idea and there are certainly more sophisticated methods to speed up this process. – Epode 5/12, 2012 at 9:47

@Erepsin consider accepting this answer if you are satisfied with it. :) – Epode 7/8, 2014 at 11:10

I recently replied on a similar quiestion here: Hausdorff distance between 3D grids

I hope this helps, I faced with 25 x 25.000 points in a pairwise comparison (25 x 25 x 25.000 points in total), and my code runs from 1 min up to 3-4 hours (depending on the number of points). I don't see much options mathemtically for gaining speed.

Alternatives can be to use different programming languages (C / C++) or bringing this calculation to GPU (CUDA). I am playing with the CUDA approach right now.

Edit on 03/12/2015:

I was able to speed up this comparison by doing parallel CPU based computation. That was the quickest way to go. I used the nice example of the pp package (parallel python) and I run on three different computer and phython combination. Unfortunately I had memory errors all the time with the python 2.7 32-bit, so I installed WinPython 2.7 64-bit and some experimental numpy 64-bit packages.

So to me this effor was quite helpful an it was not as complicated to me as the CUDA.... Good luck

Trappist answered 26/11, 2015 at 10:13 Comment(0)

There are several methods you can try:

Using numpy-MKL which makes use of Intel's high performance Math Kernel Library instead of numpy;
Using Bootleneck for array functions;
Using Cpython for computation.

Planet answered 4/12, 2012 at 1:53 Comment(0)

Edit on 03/12/2015:

Recommended topics

Hot tags