Python Earth Mover Distance of 2D arrays
Asked Answered
E

1

16

I would like to compute the Earth Mover Distance between two 2D arrays (these are not images).

Right now I go through two libraries: scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html) and pyemd (https://pypi.org/project/pyemd/).

#define a sampeling method
def sampeling2D(n, mu1, std1, mu2, std2):
   #sample from N(0, 1) in the 2D hyperspace
   x = np.random.randn(n, 2)

   #scale N(0, 1) -> N(mu, std)
   x[:,0] = (x[:,0]*std1) + mu1
   x[:,1] = (x[:,1]*std2) + mu2

   return x

#generate two sets
Y1 = sampeling2D(1000, 0, 1, 0, 1)
Y2 = sampeling2D(1000, -1, 1, -1, 1)

#compute the distance
distance = pyemd.emd_samples(Y1, Y2)

While the scipy version doesn't accept 2D arrays and it returns an error, the pyemd method returns a value. If you see from the documentation, it says that it accept only 1D arrays, so I think that the output is wrong. How can I calculate this distance in this case?

Enthrall answered 19/8, 2019 at 19:8 Comment(0)
M
23

So if I understand you correctly, you're trying to transport the sampling distribution, i.e. calculate the distance for a setup where all clusters have weight 1. In general, you can treat the calculation of the EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem: Your two arrays are the partitions in a bipartite graph, and the weights between two vertices are your distance of choice. Assuming that you want to use the Euclidean norm as your metric, the weights of the edges, i.e. the ground distances, may be obtained using scipy.spatial.distance.cdist, and in fact SciPy provides a solver for the linear sum assignment problem as well in scipy.optimize.linear_sum_assignment (which recently saw huge performance improvements which are available in SciPy 1.4. This could be of interest to you, should you run into performance problems; the 1.3 implementation is a bit slow for 1000x1000 inputs).

In other words, what you want to do boils down to

from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment

d = cdist(Y1, Y2)
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)

It is also possible to use scipy.sparse.csgraph.min_weight_bipartite_full_matching as a drop-in replacement for linear_sum_assignment; while made for sparse inputs (which yours certainly isn't), it might provide performance improvements in some situations.

It might be instructive to verify that the result of this calculation matches what you would get from a minimum cost flow solver; one such solver is available in NetworkX, where we can construct the graph by hand:

import networkx as nx

G = nx.DiGraph()

# Represent elements in Y1 by 0, ..., 999, and elements in
# Y2 by 1000, ..., 1999.
for i in range(n):
    G.add_node(i, demand=-1)
    G.add_node(n + i, demand=1)

for i in range(n):
    for j in range(n):
        G.add_edge(i, n + j, capacity=1, weight=d[i, j])

At this point, we can verify that the approach above agrees with the minimum cost flow:

In [16]: d[assignment].sum() == nx.algorithms.min_cost_flow_cost(G)
Out[16]: True

Similarly, it's instructive to see that the result agrees with scipy.stats.wasserstein_distance for 1-dimensional inputs:

from scipy.stats import wasserstein_distance

np.random.seed(0)
n = 100

Y1 = np.random.randn(n)
Y2 = np.random.randn(n) - 2
d =  np.abs(Y1 - Y2.reshape((n, 1)))

assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)       # 1.9777950447866477
print(wasserstein_distance(Y1, Y2))  # 1.977795044786648
Metathesis answered 19/8, 2019 at 20:14 Comment(9)
I actually really like your problem re-formulation. This can be used for a limit number of samples, but it work. Thanks!!Enthrall
Great, you're welcome. If the answer is useful, you can mark it as accepted. And for the record, expect the 1000x1000 case it to take less than a second with the 1.4 implementation.Metathesis
Sorry, I thought that I accepted it. May I ask you which version of scipy are you using? Because I am working on Google Colaboratory, and using the last version "Version: 1.3.1". However, it still "slow", so I can't go over 1000 of samples.Enthrall
Yes, 1.3.1 is the latest official release; you can pick up a pre-release of 1.4 from github.com/scipy/scipy (or just wait until it's officially available). Alternatively, if you don't mind building a package yourself, several efficient implementations of solvers for the problem can be found. For instance, the one available at gist.github.com/kylemcdonald/3dcce059060dbd50967970905cf54cd9 solves a 1000x1000 case in about 50 ms, where it would take 100 seconds in SciPy 1.3.1. See the comments to the gist for usage instructions.Metathesis
For a comparison of that one in particular with the implementation in SciPy, see nbviewer.jupyter.org/urls/gist.githubusercontent.com/fuglede/… -- there are also some useful benchmarks in github.com/scipy/scipy/pull/9800 and github.com/scipy/scipy/pull/10296Metathesis
And SciPy 1.4 was released in December 2019, so those improvements are now generally available.Metathesis
"EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem" This doesn't sound right. In the Earth Mover setup, you're allowed to distribute each input point to multiple output points. This is not the case for linear sum assignment. It's very interesting that in the 1D case each input point gets assigned to exactly one output point (empirically), but it's definitely not obvious (to me) this should be the case in general.Snap
@AlexEftimiades: Are you happy with the minimum cost flow formulation? If so, the integrality theorem for min-cost flow problems tells us that since all demands are integral (1), there is a solution with integral flow along each edge (hence 0 or 1), which in turn is exactly an assignment.Metathesis
I have the same question as that raised by @Alex Eftimiades, i.e., one input point may flow to multiple output points. So I compared linear assignment method with Wasserstein distance from github.com/PythonOT/POT package and got the same results. It looks somewhat they are equivalent? So can you expand on how can the integrality theorem be applied here? (the input/output values are not integers at all)Watchdog

© 2022 - 2024 — McMap. All rights reserved.