Is there a way to measure the distance between two distributions in a multidimensional space in python?

Asked 29/6, 2019 at 19:2 Answered 30/6, 2021 at 14:21

I want to measure the distance between two distributions in a multidimensional space.

For example, I would like to make measurements such as Wasserstein distribution or the energy distance in multiple dimensions, not one-dimensional comparisons.

I found a package in 1D, but I still found one in multi-dimensional. How can I get out of the way?

1D energy distance 1D Wasserstein distance

Galvanism answered 29/6, 2019 at 19:2 Comment(0)

You can use geomloss or dcor packages for the more general implementation of the Wasserstein and Energy Distances respectively. The geomloss also provides a wide range of other distances such as hausdorff, energy, gaussian, and laplacian distances. It also uses different backends depending on the volume of the input data, by default, a tensor framework based on pytorch is being used.

dcor uses scipy.spatial.distance.pdist and scipy.spatial.distance.cdist primarily to calculate the eneryg distance.

Here's a few examples of 1D, 2D, and 3D distance calculation:

# create random 3D data for the test
import torch
torch.random.manual_seed(0)
X = torch.rand((3,100))
Y = torch.rand((3,100))

Energy Distance

# energy distance with geomloss 
from geomloss import SamplesLoss
Loss =  SamplesLoss("energy")

# 3D tensors
Loss( X, Y ).item() 
>>> 0.0063512325286865234

# 2D tensors
Loss( X[:,0:2], Y[:,0:2] ).item() 
>>> 0.005196928977966309

# 1D tensors
Loss( X[:,0,np.newaxis], Y[:,0,np.newaxis] ).item() 
>>> 0.004647582769393921

# energy distance with dcor
import dcor

# 3D tensors
dcor.energy_distance(X,Y)/2
>>> 0.006351688367976838

# 2D tensors
dcor.energy_distance(X[:,0:2], Y[:,0:2])/2
>>> 0.005197086538981377

# 1D tensors
dcor.energy_distance(X[:,0,np.newaxis], Y[:,0,np.newaxis])/2 
>>> 0.004647575813531868

As you might have noticed, I divided the energy distance by two. That's due to the fact that the geomloss calculates energy distance divided by two and I wanted to compare the results between the two packages.

Original Energy Distance

Energy Distance divided by two

You can also look at my implementation of energy distance that is compatible with different input dimensions. It is written using Numba that parallelizes the computation and uses available hardware boosts and in principle should be possible to run it on GPU but I haven't tried. Albeit, it performs slower than dcor implementation.

Wasserstein Distance

Calculating the Wasserstein distance is a bit evolved with more parameters. Sinkhorn distance is a regularized version of Wasserstein distance which is used by the package to approximate Wasserstein distance. It could also be seen as an interpolation between Wasserstein and energy distances, more info in this paper. In principle, for small values of blur near to zero, you would expect to get Wasserstein and for larger values, you get energy distance but for some reason (I think due to due some implementation issues and numerical/precision issues) after some large values, you get some negative value for the distance. Anyhow, if you are interested in Wasserstein distance here is an example:

Loss =  SamplesLoss("sinkhorn", blur=0.05,)
Loss( X, Y ).item()
>>> 0.01524302177131176

Loss( X[:,0:2], Y[:,0:2] ).item()
>>> 0.005164701491594315

Loss( X[:,0,np.newaxis], Y[:,0,np.newaxis] ).item()
>>> 0.0018786040600389242

Other than the blur, I recommend looking into other parameters of this method such as p, scaling, and debias. Please note that the implementation of this method is a bit different with scipy.stats.wasserstein_distance, and you may want to look into the definitions from the documentation or code before doing any comparison between the two for the 1D case!

Bailsman answered 30/6, 2021 at 14:21 Comment(0)

I reckon you want to measure the distance between two distributions anyway? Even if your data is multidimensional, you can derive distributions of each array by flattening your arrays flat_array1 = array1.flatten() and flat_array2 = array2.flatten(), measure the distributions of each (my code is for cumulative distribution but you can go Gaussian as well) - I am doing the flattening in my function here:

`def ecdf(data):
    '''compute eCDF of an image'''
    data_flatten = data.flatten()
    sort_data = np.sort(data_flatten)
    values, bins = np.histogram(sort_data, normed=True)
    cum_data = np.cumsum(values)

    return (bins, cum_data)`

and then measure the distances between the two distributions.

Say if you had two 3D arrays and you wanted to measure the similarity (or dissimilarity which is the distance), you may retrieve distributions using the above function and then use entropy, Kullback Liebler or Wasserstein Distance.

Demerol answered 13/8, 2020 at 0:51 Comment(1)

In general, with this approach, part of the geometry of the object could be lost due to flattening and this might not be desired in some applications depending on where and how the distance is being used or interpreted. – Bailsman 30/6, 2021 at 14:32

Energy Distance

Wasserstein Distance

Recommended topics

Hot tags