How can I implement max norm constraints in an MLP in tensorflow?

Asked 14/6, 2016 at 2:1 Answered 22/9, 2016 at 18:8

How can I implement max norm constraints on the weights in an MLP in tensorflow? The kind that Hinton and Dean describe in their work on dark knowledge. That is, does tf.nn.dropout implement the weight constraints by default, or do we need to do it explicitly, as in

https://arxiv.org/pdf/1207.0580.pdf

"If these networks share the same weights for the hidden units that are present. We use the standard, stochastic gradient descent procedure for training the dropout neural networks on mini-batches of training cases, but we modify the penalty term that is normally used to prevent the weights from growing too large. Instead of penalizing the squared length (L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming weight vector for each individual hidden unit. If a weight-update violates this constraint, we renormalize the weights of the hidden unit by division."

Keras appears to have it

http://keras.io/constraints/

Lucrece answered 14/6, 2016 at 2:1 Comment(0)

tf.nn.dropout does not impose any norm constraint. I believe what you're looking for is to "process the gradients before applying them" using tf.clip_by_norm.

For example, instead of simply:

# Create an optimizer + implicitly call compute_gradients() and apply_gradients()
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

You could:

# Create an optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Compute the gradients for a list of variables.
grads_and_vars = optimizer.compute_gradients(loss, [weights1, weights2, ...])
# grads_and_vars is a list of tuples (gradient, variable).
# Do whatever you need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_norm(gv[0], clip_norm=123.0, axes=0), gv[1])
                         for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients
optimizer = optimizer.apply_gradients(capped_grads_and_vars)

I hope this helps. Final notes about tf.clip_by_norm's axes parameter:

If you're calculating tf.nn.xw_plus_b(x, weights, biases), or equivalently matmul(x, weights) + biases, when the dimensions of x and weights are (batch, in_units) and (in_units, out_units) respectively, then you probably want to set axes == [0] (because in this usage each column details all incoming weights to a specific unit).
Pay attention to the shape/dimensions of your variables above and whether/how exactly you want to clip_by_norm each of them! E.g. if some of [weights1, weights2, ...] are matrices and some aren't, and you call clip_by_norm() on the grads_and_vars with the same axes value like in the List Comprehension above, this doesn't mean the same thing for all the variables! In fact, if you're lucky, this will result in a weird error like ValueError: Invalid reduction dimension 1 for input with 1 dimensions, but otherwise it's a very sneaky bug.

Sorn answered 22/9, 2016 at 18:8 Comment(0)

You can use tf.clip_by_value:

https://www.tensorflow.org/versions/r0.10/api_docs/python/train/gradient_clipping

Gradient clipping is also used to prevent weight explosion in recurrent neural networks.

Greywacke answered 10/8, 2016 at 3:16 Comment(1)

Wouldn't that be tf.clip_by_norm rather than tf.clip_by_value? – Policyholder 18/8, 2016 at 16:49

Recommended topics

Hot tags