Wasserstein loss can be negative?

Asked 19/7, 2019 at 1:58 Answered 23/2, 2021 at 21:3

Solved python machine-learning keras neural-network generative-adversarial-network

I'm currently training a WGAN in keras with (approx) Wasserstein loss as below:

def wasserstein_loss(y_true, y_pred):
    return K.mean(y_true * y_pred)

However, this loss can obviously be negative, which is weird to me.

I trained the WGAN for 200 epochs and got the critic Wasserstein loss training curve below.

The above loss is calculated by

d_loss_valid = critic.train_on_batch(real, np.ones((batch_size, 1)))
d_loss_fake = critic.train_on_batch(fake, -np.ones((batch_size, 1)))
d_loss, _ = 0.5*np.add(d_loss_valid, d_loss_fake)

The resulting generated sample quality is great, so I think I trained the WGAN correctly. However I still cannot understand why the Wasserstein loss can be negative and the model still works. According to the original WGAN paper, Wasserstein loss can be used as a performance indicator for GAN, so how should we interpret it? Am I misunderstand anything?

Tradescantia answered 19/7, 2019 at 1:58 Comment(1)

You are correct that the Wasserstein Distance as used mathematically is a metric/distance function and hence is non-negative by definition. However implementations of the function are not necessarily rigorous in all ranges. This can lead to some issues. – Swithbert 20/11, 2019 at 10:23

The Wasserstein loss is a measurement of Earth-Movement distance, which is a difference between two probability distributions. In tensorflow it is implemented as d_loss = tf.reduce_mean(d_fake) - tf.reduce_mean(d_real) which can obviously give a negative number if d_fake moves too far on the other side of d_real distribution. You can see it on your plot where during the training your real and fake distributions changing sides until they converge around zero. So as a performance measurement you can use it to see how far the generator is from the real data and on which side it is now.

See the distributions plot:

P.S. it's crossentropy loss, not Wasserstein. Perhaps this article can help you more, if you didn't read it yet. However, the other question is how the optimizer can minimize the negative loss (to zero).

Hospitality answered 11/10, 2019 at 17:38 Comment(0)

Looks like I cannot make a comment to the answer given by Sergeiy Isakov because I do not have enough reputations. I wanted to comment because I think that information is not correct.

In principle, Wasserstein distance cannot be negative because distance metric cannot be negative. The actual expression (dual form) for Wasserstein distance involves the supremum of all the 1-Lipschitz functions (You can refer to it on the web). Since it is the supremum, we always take that Lipschitz function that gives the largest value to obtain the Wasserstein distance. However, the Wasserstein we compute using WGAN is just an estimate and not really the real Wasserstein distance. If the inner iterations of the critic are low it may not have enough iterations to move to a positive value.

Thought experiment: If we suppose that we obtain a Wasserstein estimate that is negative, we can always negate the critic function to make the estimate positive. That means there exist a Lipschitz function that gives a positive value which is larger than that Lipschitz function that gives negative value. So Wasserstein estimates cannot be negative as by definition we need to have the supremum of all the 1-Lipschitz functions.

Winker answered 23/2, 2021 at 21:3 Comment(1)

Thank you for this insight. IMO, in the initial stages of the training, since the critic is not yet 1-Lipschitz, the value obtained is not exactly an EMD estimate. I would be worried if the gradient penalty term is close to zero and we still get a negative value. – Lajoie 4/11, 2022 at 5:21

Recommended topics

Hot tags