If we can clip gradient in WGAN, why bother with WGAN-GP?

Asked 6/11, 2019 at 5:51 Answered 18/3, 2020 at 8:57

Solved machine-learning gradient-descent generative-adversarial-network

I am working on WGAN and would like to implement WGAN-GP.

In its original paper, WGAN-GP is implemented with a gradient penalty because of the 1-Lipschitiz constraint. But packages out there like Keras can clip the gradient norm at 1 (which by definition is equivalent to 1-Lipschitiz constraint), so why do we bother to penalize the gradient? Why don't we just clip the gradient?

Trill answered 6/11, 2019 at 5:51 Comment(0)

The reason is that clipping in general is a pretty hard constraint in a mathematical sense, not in a sense of implementation complexity. If you check original WGAN paper, you'll notice that clip procedure inputs model's weights and some hyperparameter c, which controls range for clipping.

If c is small then weights would be severely clipped to a tiny values range. The question is how to determine an appropriate c value. It depends on your model, dataset in a question, training procedure and so on and so forth. So why not to try soft penalizing instead of hard clipping? That's why WGAN-GP paper introduces additional constraint to a loss function that forces gradient's norm to be as much close to 1 as possible, avoiding hard collapsing to a predefined values.

Mori answered 6/11, 2019 at 6:10 Comment(5)

If you're satisfied, please mark as resolved, since there is noting to add. :) – Mori 8/11, 2019 at 7:14

I just realized that I am talking about the clipping of gradient, not clipping of weights. Does your explanation apply to gradient as well? – Trill 11/11, 2019 at 6:44

@Trill Not exactly. WGAN introduces weight clipping as a way to regularize model's weights. WGAN-GP replaces weight clipping with gradient penalty. Gradient clipping is a different beast and it can't be used for this task, it does not impose proper mathematical constraints on a model/weights/constraints. – Mori 11/11, 2019 at 7:23

@Trill The reason is, I guess, that a current gradient, produced by current error and data, even constrained to some c does not guarantee that final model would satisfy lipschitz continuity. – Mori 11/11, 2019 at 7:35

I see. This answers my question perfectly. – Trill 11/11, 2019 at 7:50

The answer by CaptainTrunky is correct but I also wanted to point out one, really important, aspect.

Citing the original WGAN-GP paper:

Implementing k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in [Corollary 1], the optimal WGAN critic has unit gradient norm almost everywhere under Pr and Pg; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm k end up learning extremely simple functions.

So as You can see weight clipping may (it depends on the data You want to generate - autors of this article stated that it doesn't always behave like that) lead to undesired behaviour. When You will try to train WGAN to generate more complex data the task has high possibility of failure.

Fryd answered 18/3, 2020 at 8:57 Comment(0)

Recommended topics

Hot tags