How does one debug NaN values in TensorFlow?
Asked Answered
R

9

59

I was running TensorFlow and I happen to have something yielding a NaN. I'd like to know what it is but I do not know how to do this. The main issue is that in a "normal" procedural program I would just write a print statement just before the operation is executed. The issue with TensorFlow is that I cannot do that because I first declare (or define) the graph, so adding print statements to the graph definition does not help. Are there any rules, advice, heuristics, anything to track down what might be causing the NaN?


In this case I know more precisely what line to look at because I have the following:

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z) 

when this line is present I have it that it returns NaN as declared by my summary writers. Why is this? Is there a way to at least explore what value Z has after its being square rooted?


For the specific example I posted, I tried tf.Print(0,Z) but with no success it printed nothing. As in:

Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
tf.Print(0,[Z]) # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z) 

I actually don't understand what tf.Print is suppose to do. Why does it need two arguments? If I want to print 1 tensor why would I need to pass 2? Seems bizarre to me.


I was looking at the function tf.add_check_numerics_ops() but it doesn't say how to use it (plus the docs seem to not be super helpful). Does anyone know how to use this?


Since I've had comments addressing the data might be bad, I am using standard MNIST. However, I am computing a quantity that is positive (pair-wise eucledian distance) and then square rooting it. Thus, I wouldn't see how the data specifically would be an issue.

Rosmunda answered 7/8, 2016 at 2:47 Comment(0)
N
27

There are a couple of reasons WHY you can get a NaN-result, often it is because of too high a learning rate but plenty other reasons are possible like for example corrupt data in your input-queue or a log of 0 calculation.

Anyhow, debugging with a print as you describe cannot be done by a simple print (as this would result only in the printing of the tensor-information inside the graph and not print any actual values).

However, if you use tf.print as an op in bulding the graph (tf.print) then when the graph gets executed you will get the actual values printed (and it IS a good exercise to watch these values to debug and understand the behavior of your net).

However, you are using the print-statement not entirely in the correct manner. This is an op, so you need to pass it a tensor and request a result-tensor that you need to work with later on in the executing graph. Otherwise the op is not going to be executed and no printing occurs. Try this:

Z = tf.sqrt(Delta_tilde)
Z = tf.Print(Z,[Z], message="my Z-values:") # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
Noyade answered 9/8, 2016 at 7:49 Comment(3)
Why does one have to pass the first Z if the second Z is the data? In essence, the API for tf.Print is confusing. Why do we need two input arguments to print one single thing?Rosmunda
The list of tensors [Z] is printed when the first tensor Z is evaluated. Sometimes one may want to print out different things.Thromboembolism
Here is a small snip that I find useful for some tensor x: DEBUGGING = False x = x if not DEBUGGING else tf.Print(x, [x], 'Value of x: ') Bolan
B
17

I used to find it's much tougher to pinpoint where the nans and infs may occur than to fix the bug. As a complementary to @scai's answer, I'd like to add some points here:

The debug module, you can imported by:

from tensorflow.python import debug as tf_debug

is much better than any print or assert.

You can just add the debug function by changing your wrapper you session by:

sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

And you'll prompt an command line interface, then you enter: run -f has_inf_or_nan and lt -f has_inf_or_nan to find where the nans or infs are. The first one is the first place where the catastrophe occurs. By the variable name you can trace the origin in your code.

Reference: https://developers.googleblog.com/2017/02/debug-tensorflow-models-with-tfdbg.html

Bigham answered 11/2, 2018 at 7:25 Comment(2)
Do you have the experiment of super slow down of your program when debugging with this tf_debug add-on. Besides, I can not run tf_debug mode by terminal command, can only run this debug setting by pycharm debug mode.Chapa
Besides, I need to add ui_type="readline" parameters to LocalCLIDebugWrapperSession to make it work. sess = tf_debug.LocalCLIDebugWrapperSession(sess, ui_type="readline") ref: #52748155Chapa
E
11

As of version 0.12, TensorFlow is shipped with a builtin debugger called tfdbg. It optimizes the workflow of debugging this type of bad-numerical-value issues (like inf and nan). The documentation is at: https://www.tensorflow.org/programmers_guide/debugger

Erickaericksen answered 16/12, 2016 at 20:21 Comment(0)
R
9

It look like you can call it after you complete making the graph.

check = tf.add_check_numerics_ops()

I think this will add the check for all floating point operations. Then in the sessions run function you can add the check operation.

sess.run([check, ...])

Renfro answered 7/8, 2016 at 11:8 Comment(1)
FYI this misses some ops when optimizers are used -- github.com/tensorflow/tensorflow/issues/2288Fortunna
K
5

First of all, you need to check you input data properly. In most cases this is the reason. But not always, of course.

I usually use Tensorboard to see whats happening while training. So you can see the values on each step with

Z = tf.pow(Z, 2.0)    
summary_z = tf.scalar_summary('z', Z) 
#etc..
summary_merge = tf.merge_all_summaries()
#on each desired step save: 
    summary_str = sess.run(summary_merge)
    summary_writer.add_summary(summary_str, i)

Also you can simply eval and print the current value:

 print(sess.run(Z))
Kiwanis answered 9/8, 2016 at 8:22 Comment(3)
the issue is that its getting NaN values so I the summary writer actually exits my script so I'm unable to see it. Are you suggesting to instead write the value before the op that might be causing the NaN? (probably before the sqrt) Also, this is part of a network, so I call sess.run on some train op. I can't just sess.run Z unfortunately (or I don't know how to).Rosmunda
You can run some ops by op1_answer, op2_answer, opN_answer = sess.run([op1, op2, opN], feed_dict = {etc..})Kiwanis
Thanks! My input data has empty rows... Your answer solved my issue.Joaquinajoash
S
5

For TensorFlow 2, inject some x=tf.debugging.check_numerics(x,'x is nan') into your code. They will throw an InvalidArgument error if xhas any values that are not a number (NaN) or infinity (Inf).

Oh and for the next person finding this when hunting a TF2 NaN issue, my case turned out to be an exploding gradient. The gradient itself got to 1e+20, which was not quite NaN yet, but adding that to the variable then turned out too big. The diagnosis that I did was

gradients = tape.gradient(loss, training_variables)
for g,v in zip(gradients, training_variables):
  tf.print(v.name, tf.reduce_max(g))
optimizer.apply_gradients(zip(gradients, training_variables))

which revealed the overly large numbers. Running the exact same network on CPU worked fine, but it failed on the GTX 1080 TI in my workstation, thus making a CUDA numerical stability issue likely as the root cause. But since it only occurred sometimes, I duct-taped the whole thing by going with:

gradients = tape.gradient(loss, training_variables)
gradients = [tf.clip_by_norm(g, 10.0) for g in gradients]
optimizer.apply_gradients(zip(gradients, training_variables))

which will just clip exploding gradients to a sane value. For a network where gradients are always high, that wouldn't help, but since the magnitudes where high only sporadically, this fixed the problem and now the network trains nicely also on GPU.

Selfpollination answered 1/12, 2019 at 9:55 Comment(1)
Does check_numerics() work during training? The example in the docs wraps it into a try-catch. Is this working in graph-mode? Also, why are you assigning x = check_numerics(x) ?Watercraft
B
4

NANs occurring in the forward process are one thing and those occurring in the backward process are another.

Step 0: data

Make sure that there are no extreme inputs such as NAN inputs or negative labels in the prepared dataset using NumPy tools, for instance: assert not np.any(np.isnan(x)).

Step 1: the forward

Switch to a CPU environment to get a more detailed traceback, and test the forward pass only by loss = tf.stop_gradient(loss) before calculating the gradients to see if you can run several batches with no errors. If an error occurs, there are several types of potential bugs and methods:

  1. 0 in the log for the cross-entropy loss functions(please refer to this answer)
  2. 0/0 problem
  3. out of class problem as issued here.
  4. try tensor = tf.check_numerics(tensor, 'tensor') in some suspicious places.
  5. try tf_debug as written in this answer.

Step 2: the backward

If everything goes well, remove the loss = tf.stop_gradient(loss).

  1. try very small learning rate
  2. replace complex blocks of code by simple computations, like full connection, with the same shape of inputs and outputs to zoom in where the bug lies. You may encounter backward bugs like this.

As an aside, it's always helpful to make sure that the shape of every tensor is desired. You can try to input fixed-sized batches(drop the remainders) and reshape the feature tensors(where the graph receives data from Dataset) as you expect them to be(otherwise the first dimension would be None sometimes) and then print the shape of the very tensor in the graph with fixed numbers.

A Recipe for Training Neural Networks by Andrej Karpathy is a great article on training/debugging neural networks.

Bigham answered 21/8, 2020 at 17:15 Comment(0)
D
1

Current implementation of tfdbg.has_inf_or_nan seems do not break immediately on hitting any tensor containing NaN. When it does stop, the huge list of tensors displayed are not sorted in order of its execution. A possible hack to find the first appearance of Nans is to dump all tensors to a temporary directory and inspect afterwards. Here is a quick-and-dirty example to do that. (Assuming the NaNs appear in the first few runs)

Dandiprat answered 15/5, 2018 at 13:16 Comment(0)
M
1

I was able to fix my NaN issues by getting rid of all of my dropout layers in the network model. I suspected that maybe for some reason a unit (neuron?) in the network lost too many input connections (so it had zero after the dropout), so then when information was fed through, it had a value of NaN. I don't see how that could happen over and over again with dropout=0.8 on layers with more than a hundred units each, so the problem was probably fixed for a different reason. Either way, commenting out the dropout layers fixed my issue.

EDIT: Oops! I realized that I added a dropout layer after my final output layer which consists of three units. Now that makes more sense. So, don't do that!

Mohammadmohammed answered 6/1, 2019 at 13:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.