Tensorflow NaN bug?
Asked Answered
P

15

68

I'm using TensorFlow and I modified the tutorial example to take my RGB images.

The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding

print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())

as debug code to each loop, yields the following output:

Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38

Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.

I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.

Pedigree answered 14/11, 2015 at 19:1 Comment(0)
P
149

Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

solved all my problems.

Pedigree answered 14/11, 2015 at 20:49 Comment(6)
Glad you solved it! As an additional note, you might find convolutional.py a better starting point if you're handling real data. It's parameterized with NUM_CHANNELS at the top of the file, and if you switch it from 1 to 3, you should be good to go with RGB data. I've used it out of the box for classifying some larger RGB datasets that were downsized to "mnist size" (28x28) and it works pretty decently. The key is using tf.nn.softmax_cross_entropy_with_logitsEvangelineevangelism
@Evangelineevangelism here's the updated link to convolution.py as it is no longer in the tensorflow master branchCoreligionist
Note: this solution introduces bias. Ive posted an answer below which avoids this problem.Addax
Why not just tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y) (usually no need to manually clip logits), instead of your y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0))? This was mentioned in the beginner tutorial.Scriabin
@YiboYang I think you should consider posting this comment as an answer. Pointing out that this was covered in the beginner tutorial is a valuable contribution here, since many people with this problem may have seen the hand-written formula in the tutorial and missed the pointer to tf.nn.softmax_cross_entropy_with_logits (like I did). It is helpful to be shown that the tutorial can still be trusted.Bergman
@YiboYang ...of course, you should probably clarify how to actually use the function (passing in the scores before softmax).Bergman
A
29

A bias free alternative.

Many of the other solutions use clipping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may not be acceptable in all cases. As the following code demonstrates, we need only handle the point of discontinuity--not the region near it.

Specific Answer

def cross_entropy(x, y, axis=-1):
  safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)
  return -tf.reduce_sum(x * tf.log(safe_y), axis)

def entropy(x, axis=-1):
  return cross_entropy(x, x, axis)

But did it work?

x = tf.constant([0.1, 0.2, 0., 0.7])
e = entropy(x)
# ==> 0.80181855
g = tf.gradients(e, x)[0]
# ==> array([1.30258512,  0.60943794, 0., -0.64332503], dtype=float32)  Yay! No NaN.

(Note: deleted dup cross-post.)

General Recipe

Use an inner tf.where to ensure the function has no asymptote. That is, alter the input to the inf generating function such that no inf can be created. Then use a second tf.where to always select the valid code-path. That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.

In Python code, the recipe is:

Instead of this:

tf.where(x_ok, f(x), safe_f(x))

Do this:

safe_x = tf.where(x_ok, x, safe_x)
tf.where(x_ok, f(safe_x), safe_f(x))

Example

Suppose you wish to compute:

f(x) = { 1/x, x!=0
       { 0,   x=0

A naive implementation results in NaNs in the gradient, i.e.,

def f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  return tf.where(x_ok, f(x), safe_f(x))

Does it work?

x = tf.constant([-1., 0, 1])
tf.gradients(f(x), x)[0].eval()
# ==> array([ -1.,  nan,  -1.], dtype=float32)
#  ...bah! We have a NaN at the asymptote despite not having
# an asymptote in the non-differentiated result.

The basic pattern for avoiding NaN gradients when using tf.where is to call tf.where twice. The innermost tf.where ensures that the result f(x) is always finite. The outermost tf.where ensures the correct result is chosen. For the running example, the trick plays out like this:

def safe_f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  safe_x = tf.where(x_ok, x, tf.ones_like(x))
  return tf.where(x_ok, f(safe_x), safe_f(x))

But did it work?

x = tf.constant([-1., 0, 1])
tf.gradients(safe_f(x), x)[0].eval()
# ==> array([-1.,  0., -1.], dtype=float32)
# ...yay! double-where trick worked. Notice that the gradient
# is now a constant at the asymptote (as opposed to being NaN).
Addax answered 27/2, 2017 at 23:8 Comment(4)
I was a bit confused about the behavior with your naive impl. and using tf.where twice to solve this but it's easy to understand if you plot yourself the computation graph of the gradient. At some point there is grad(1./x, x) * 0.0 which will result in nan. Btw, tf.cond does not have this issue but this is not really an alternative in most cases.Amalle
Hi Albert--thanks for pointing this out. I've corrected a few bugs in the general procedure and improved the example.Addax
This! Great answer! It should be part of an advanced TensorFlow Tutorial/docs or similarOrthodox
Note: Ive also documented this answer here: github.com/tensorflow/probability/blob/master/discussion/…Addax
G
26

Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))
Grati answered 30/7, 2016 at 11:4 Comment(3)
This is exactly what I'm doing in my network, but I'm still getting NaNs when computing what amounts to the following: tf.log(1e-10 + 1 - 1). If I print out the data and compute the same value in Excel I get the correct value of -23.Neville
@fwc, I encountered the same issue. Increasing it to something like tf.log(1e-7+...) solved the problem.Including
@fwc I was able to reproduce this issue and filled a bug report here: github.com/tensorflow/tensorflow/issues/25728Bacolod
B
15

If y_conv is the result of a softmax, say, y_conv = tf.nn.softmax(x), then an even better solution is to replace it with log_softmax:

y = tf.nn.log_softmax(x)
cross_entropy = -tf.reduce_sum(y_*y)
Benempt answered 20/7, 2016 at 19:52 Comment(0)
G
4

You are trying to calculate cross-entropy using the standard formula. Not only the value is undefinined when x=0, it is also numerically unstable.

It is better to use tf.nn.softmax_cross_entropy_with_logits or if you really want to use hand-crafted formula, to tf.clip_by_value zeros to very small number in the log.

Gamma answered 29/4, 2017 at 5:32 Comment(0)
C
3

Sometimes you use tf.sqrt() function without adding a small constant 1e-10 in it, inducing this nan problem.

Clomp answered 27/10, 2018 at 2:44 Comment(2)
derviative of sqrt at 0 is infinite which likely causes the instability.Ayo
It can also be "hidden" : I was using tf.math.reduce_euclidean_norm with compute the true norm (sic) instead of the squared one usually used for trainings...Skees
G
2

I used LSTM for long sequences and got nan gradients. None of these answers helped me. But I came up with three own solutions. I hope they will be useful for some other people who came here from google search.

  1. Gradient clipping didn't help me because gradients turned nan in one batch update. In this case, you can replace nans with zeros with such lines:

    opt = tf.train.AdamOptimizer(args.lr)
    grads = opt.compute_gradients(loss)
    grads2 = [(tf.where(tf.is_nan(grad), tf.zeros(grad.shape), grad), var) for grad, var in grads]
    opt_op = opt.apply_gradients(grads2)
    

    If you want to track if nans appeared you can use this code:

    was_nan = tf.reduce_any(tf.convert_to_tensor([tf.reduce_any(tf.is_nan(g)) for g in grads]))
    
  2. Replace LSTMCell with LayerNormBasicLSTMCell - an LSTM cell with layer norm - something similar to batch norm between timesteps.

  3. If you use regular recurrent state dropout you can replace it with "Recurrent Dropout without Memory Loss". Code:

    LayerNormBasicLSTMCell(neurons, dropout_keep_prob=0.8)
    

    Note that you can also turn on the dropout feature alone without layer normalization:

    LayerNormBasicLSTMCell(neurons, layer_norm=False, dropout_keep_prob=0.8)
    
Glorianna answered 6/12, 2017 at 19:33 Comment(0)
U
2

Besides all the great answers above, I will add mine. It's a scenario less common to run into, but does cause NaN: divide by zero.

In my network for a NLP task, there is a layer that does average pooling. Namely, each data is a sequence of tokens. My layer does some token embedding and then calculates the average of the embedded vector.

The average calculation is coded as

tf.reduce_sum(embedded)/tf.reduce_sum(tf.not_equal(input, pad)) 

Here pad is some dummy token I use in batch processing.

Now if some data contains empty token list (for whatever reason), its length (the denominator in the code snippet above) would be 0. Then it causes a divide by zero issue and the NaN will remain in all the following layers/ optimization steps.

In case anyone ran into this issue, I used tf.where to smooth those length:

sum_embedding = tf.reduce_sum(embedded, 1)
embedding_length = tf.reduce_sum(tf.cast(tf.not_equal(input, pad), dtype=tf.float32), axis=1, keep_dims=True)
embedding_length_smoothed = tf.where(tf.greater(embedding_length, 0.0), embedding_length, tf.ones(tf.shape(embedding_length)))
avg_embedding = sum_embedding / embedding_length_smoothed

Essentially this treats all those data with 0-length token list to be of length 1, and avoids the NaN issue.

Unidirectional answered 2/7, 2018 at 14:49 Comment(0)
C
1

Here is the implementation of the binary (sigmoid) and categorical (softmax) cross-entropy losses in TensorFlow 1.1:

As one can see in the binary case they consider some special cases to achieve numerical stability:

# The logistic loss formula from above is
#   x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
#   -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
#   max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(relu_logits - logits * labels,
                    math_ops.log1p(math_ops.exp(neg_abs_logits)),
                    name=name)
Cornflakes answered 16/5, 2017 at 9:37 Comment(0)
B
1

2.0 Compatible Answer: Code to migrate @user1111929's Answer from

Tensorflow 1.x to Tensorflow 2.x, is shown below:

Tensorflow 1.x :

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

Tensorflow 2.x:

cross_entropy = -tf.compat.v2.reduce_sum(y_*tf.log(tf.compat.v2.clip_by_value(y_conv,1e-10,1.0)))

or

cross_entropy = -tf.compat.v2.math.reduce_sum(y_*tf.log(tf.compat.v1.clip_by_value(y_conv,1e-10,1.0)))

Bavaria answered 30/1, 2020 at 5:19 Comment(0)
M
0

I was getting nans sometimes and not other times while working on a standard feed-forward network. I have previously used similar TensorFlow code and it worked fine.

It turns out that I imported the variable names by accident. So, as soon as the first row (the variable names) was selected in a batch, the nan losses started. Maybe keep an eye out for that?

Monseigneur answered 26/2, 2018 at 18:39 Comment(0)
Y
0

I will add here one of my previous problems with NaNs. I was using the sigmoid function as the activation of the last layer of my network. However, the sigmoid activation function uses the exponential function to be computed and I got some really big numbers entering the sigmoid.

It resulted in infinite gradients and some NaNs started to appear.

Yawn answered 2/7, 2019 at 8:27 Comment(0)
D
0

I've been using Tensorflow Estimator, which I believe account for those division by zero and other numerical stability issues, and occasionally get this error (ERROR:tensorflow:Model diverged with loss = NaN during training). Most of the time when I get this is because my inputs include nans. So: be sure that your input dataframes (or whatever you use) don't have NaN values hidden somewhere in them.

Diathermy answered 12/7, 2019 at 2:6 Comment(0)
H
0

Another option is to use tf.math.xlogy function. The function description says "Returns 0 if x == 0, and x * log(y) otherwise, elementwise." You can find the documentation here: https://www.tensorflow.org/api_docs/python/tf/math/xlogy

Hurdygurdy answered 31/7, 2020 at 2:50 Comment(0)
G
0

In tf.log(y_conv) if y_conv is the output of a sigmoid activation function, there is a better way to calculate tf.log(y_conv).

Let y_conv = sigmoid(x). Then,

   log(y_conv) = log(sigmoid(x))
=  log(1 / (1 + exp(-x)))
=  log(1 / (1 + exp(-x))) - x + x =
= -log(1 + exp(-x)) - log(exp(x)) + x =
= -log(1 + exp(x)) + x
=  x - softplus(x)
Gabfest answered 9/11, 2020 at 21:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.