So if I understand you correctly, you're trying to transport the sampling distribution, i.e. calculate the distance for a setup where all clusters have weight 1. In general, you can treat the calculation of the EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem: Your two arrays are the partitions in a bipartite graph, and the weights between two vertices are your distance of choice. Assuming that you want to use the Euclidean norm as your metric, the weights of the edges, i.e. the ground distances, may be obtained using scipy.spatial.distance.cdist
, and in fact SciPy provides a solver for the linear sum assignment problem as well in scipy.optimize.linear_sum_assignment
(which recently saw huge performance improvements which are available in SciPy 1.4. This could be of interest to you, should you run into performance problems; the 1.3 implementation is a bit slow for 1000x1000 inputs).
In other words, what you want to do boils down to
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
d = cdist(Y1, Y2)
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)
It is also possible to use scipy.sparse.csgraph.min_weight_bipartite_full_matching
as a drop-in replacement for linear_sum_assignment
; while made for sparse inputs (which yours certainly isn't), it might provide performance improvements in some situations.
It might be instructive to verify that the result of this calculation matches what you would get from a minimum cost flow solver; one such solver is available in NetworkX, where we can construct the graph by hand:
import networkx as nx
G = nx.DiGraph()
# Represent elements in Y1 by 0, ..., 999, and elements in
# Y2 by 1000, ..., 1999.
for i in range(n):
G.add_node(i, demand=-1)
G.add_node(n + i, demand=1)
for i in range(n):
for j in range(n):
G.add_edge(i, n + j, capacity=1, weight=d[i, j])
At this point, we can verify that the approach above agrees with the minimum cost flow:
In [16]: d[assignment].sum() == nx.algorithms.min_cost_flow_cost(G)
Out[16]: True
Similarly, it's instructive to see that the result agrees with scipy.stats.wasserstein_distance
for 1-dimensional inputs:
from scipy.stats import wasserstein_distance
np.random.seed(0)
n = 100
Y1 = np.random.randn(n)
Y2 = np.random.randn(n) - 2
d = np.abs(Y1 - Y2.reshape((n, 1)))
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n) # 1.9777950447866477
print(wasserstein_distance(Y1, Y2)) # 1.977795044786648