Tensorflow issue with softmax

I have a Tensorflow multiclass classifier that is generating nan or inf while computing probabilities using tf.nn.softmax. See the following snippet (logits is of shape batch_size x 6, since I have 6 classes and the output is one-hot encoded). batch_size is 1024.

logits = tf.debugging.check_numerics(logits, message='bad logits', name=None)
probabilities = tf.nn.softmax(logits=logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

The classifier fails on the last statement as it finds nan or inf in probabilities. logits are clean, otherwise the first statement would have failed.

From what I read about tf.nn.softmax, it can handle very large and very small values in logits. I have verified this in interactive mode.

>>> with tf.Session() as s:
...   a = tf.constant([[1000, 10], [-100, -200], [3, 4.0]])
...   sm = tf.nn.softmax(logits=a, name='Softmax')
...   print(a.eval())
...   print(sm.eval())
...
[[1000.   10.]
 [-100. -200.]
 [   3.    4.]]
[[1.         0.        ]
 [1.         0.        ]
 [0.26894143 0.7310586 ]]

I then tried clipping the values in logits and the whole thing now works. See the modified snippet below.

logits = tf.debugging.check_numerics(logits, message='logits', name=None)
safe_logits = tf.clip_by_value(logits, -15.0, 15.0)
probabilities = tf.nn.softmax(logits=safe_logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

In second statement, I am clipping the values in logits to -15 and 15, and that somehow prevents nan/inf in softmax computation. So, I was able to fix the issue at hand.

However, I still don't understand why this clipping is working? (I should mention that clipping between -20 and 20 does not work and the model fails with nan or inf in probabilities).

Could someone help me understand why this is the case?

I am using tensorflow 1.15.0, running on a 64-bit instance.

# Pretend this is the source value that is fed to a function that generates the logit. >>> x = tf.Variable(0.001) # Let's operate on the source value to generate the logit. >>> with tf.GradientTape() as tape: ... y = tf.math.log(x) ... # The logit looks okay... -6.9. >>> y <tf.Tensor: shape=(), dtype=float32, numpy=-6.9077554> # But the gradient is exploding. >>> tape.gradient(y,x) <tf.Tensor: shape=(), dtype=float32, numpy=999.99994> >>>

# This is same source variable as above. >>> x = tf.Variable(0.001) # Now let's operate with clipping. >>> with tf.GradientTape() as tape: ... y = tf.clip_by_value(tf.math.log(x), -1., 1.) ... # The clipped logit still looks okay... >>> y <tf.Tensor: shape=(), dtype=float32, numpy=-1.0> # What may be more important is that the clipping has also zeroed out the gradient >>> tape.gradient(y,x) <tf.Tensor: shape=(), dtype=float32, numpy=0.0>

Recommended topics

Hot tags