Nan in summary histogram

Asked 4/10, 2016 at 14:3 Answered 4/1, 2022 at 15:25

python tensorflow deep-learning gradient

My program will face this some times(not every run will face this..), then if face this I can always reproduce this error loading from the last model I have saved before program crash due to nan. When rerun from this model, first train process seems fine using the model to generate loss(I have printed loss and shows no problem), but after applying gradients, the values of embedding variables will turn to Nan.

So what is the root cause of the nan problem? Confused as not know how to debug further and this program with same data and params will mostly run ok and only face this problem during some run..

Loading existing model from: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000
Train from restored model: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5235 get requests, put_count=4729 evicted_count=1000 eviction_rate=0.211461 and unsatisfied allocation rate=0.306781
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110
2016-10-04 21:45:39 epoch:1.87 train_step:18001 duration:0.947 elapsed:0.947 train_avg_metrics:['loss:0.527']  ['loss:0.527']
2016-10-04 21:45:39 epoch:1.87 eval_step: 18001 duration:0.001 elapsed:0.948 ratio:0.001
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
Traceback (most recent call last):
  File "./train.py", line 308, in <module>
    tf.app.run()

Rockfish answered 4/10, 2016 at 14:3 Comment(0)

It happens sometimes during the initial iterations of training that the model might spew out only a single prediction class. If out of random chance, the class turned out to be 0 for all the training examples then there can exist a NaN value for Categorical Cross Entropy Loss.

Make sure that you introduce a small value when computing the loss such as tf.log(predictions + 1e-8). This will help in overcoming this numerical instability.

Exaggerate answered 20/1, 2018 at 11:27 Comment(4)

Super useful! Thanks a lot! When you have a dataset with very sparse positive examples you have to deal with minibatches having no positive examples despite shuffling... that solves it! – Spikelet 18/2, 2018 at 14:14

Or use tf.nn.softmax_cross_entropy_with_logits which ensures numerical stability. – Visualize 18/7, 2018 at 12:8

How exactly do I introduce a small value when computing the loss, when the loss function is a built-in function i.e model.compile(loss='binary_crossentropy', optimizer=opt) ? – Kelleekelleher 25/10, 2018 at 13:0

@najeeb khan, can you please heave a look at #53080789 – Kelleekelleher 5/11, 2018 at 14:46

Usually NaN is a sign of model instability, for example, exploding gradients. It may be unnoticed, loss would just stop shrinking. Trying to log weights summary makes the problem explicit. I suggest you to reduce the learning rate as a first measure. If it wouldn't help, post your code here. Without seeing it it's hard suggest anything more specific.

Morphia answered 11/10, 2016 at 7:49 Comment(0)

I got a similar error and tried different learning rates, batch sizes, loss functions, and model architectures without any luck. But then I noticed that I can train my model just fine if I'm not using TensorBoard callback. Looks like "Nan in summary histogram" refers to saving model weights histogram, which somehow makes those Nans explicit.

Turning off histograms in TensorBoard callback solved the issue for me:

tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=0)

Fanaticism answered 8/11, 2019 at 19:19 Comment(0)

I had a similar problem and in my case, I changed activation from tf.nn.relu to tf.nn.sigmoid and it worked. I hope this would help.

Lashondalashonde answered 16/12, 2019 at 12:56 Comment(0)

I believe it has something to do with your system running out of memory. This especially seems to be the problem if you get the error after a certain number of steps.

Setting train to false in batch_norm (within your pipeline.config file) appears to overcome this problem.

It should look something like this:

batch_norm {
      decay: 0.999
      center: true
      scale: true
      epsilon: 0.001
      train: false
    }

Delete the training directory (logdir) and start training it again. Resuming from a recent checkpoint will result in the same error.

Hope this helped.

Kell answered 14/12, 2019 at 17:7 Comment(0)

If you are using tensorflow.keras.layers.Masking and one or more input features happen to be masked for all inputs in a batch, then you can get this error.

Similar to najeeb khan's case, but triggered differently.

This makes sense because when tensorflow calls _log_weights from on_epoch_end, some weights related to the input features that were always masked are actually still NaN.

For me the solution was to explicitly load weights (via tensorflow.keras.models.load_model)

Plaything answered 30/5, 2020 at 22:46 Comment(0)

It happened to me with RaggedTensors. I used tf.concat to split and concatenate a multidimensional ragged tensor into a flat one, ie (None, 6, 7) -> (None, 42) and started getting the error. Disabling histograms in the Tensorboard callback also fixed it for my case.

Special answered 4/1, 2022 at 15:25 Comment(0)

Recommended topics

Hot tags