Softmax Cross Entropy loss explodes

Asked 27/2, 2018 at 19:43 Answered 10/12, 2021 at 8:44

python tensorflow machine-learning deep-learning conv-neural-network

I am creating a deep convolutional neural network for pixel-wise classification. I am using adam optimizer, softmax with cross entropy.

Github Repository

I asked a similar question found here but the answer I was given did not result in me solving the problem. I also have a more detailed graph of what it going wrong. Whenever I use softmax, the problem in the graph occurs. I have done many things such as adjusting training and epsilon rates, trying different optimizers, etc. The loss never decreases past 500. I do not shuffle my data at the moment. Using sigmoid in place of softmax results in this problem not occurring. However, my problem has multiple classes, so the accuracy of sigmoid is not very good. It should also be mentioned that when the loss is low, my accuracy is only about 80%, I need much better than this. Why would my loss suddenly spike like this?

x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])

#Many Convolutions and Relus omitted

final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Svensen answered 27/2, 2018 at 19:43 Comment(9)

use a sigmoid layer and after sigmoid use a softmax layer... That's what I do... And it works for me with good results.... I won't be answering this because in the previous question I have given enough details... go back and refer my answer to your previous question... And think deeply about my answer – Dowdy 27/2, 2018 at 22:49

When I tried that, the loss started out at ~1.3 and never decreased any further. Have you ever experienced that? – Svensen 27/2, 2018 at 23:9

show me your code...your code and not some else's code and then I can help you better – Dowdy 27/2, 2018 at 23:10

I have added the relevant code to the question. The rest of the code is in the github repository linked above. I will add a bounty to this question if you help me solve the problem. – Svensen 27/2, 2018 at 23:14

I just tried using the sigmoid before softmax layer again. The loss starts at about 1.13 and does not decrease. After several epochs, the training loss becomes nan. – Svensen 28/2, 2018 at 0:24

I have few questions ... Are you doing predictions by this architecture ?? Did you check you regenerated images with individual deconv layers... How is the addition operation of the individual deconv layers happening.. Is the addition by another layer or just simple addition of tensors ?? In the code i see simple addition of all by tf.add... Is this true ?? – Dowdy 28/2, 2018 at 0:38

Yes, the architecture pictured in the repository is the architecture I am making predictions from. I just use tf.add to combine the deconvolution layers. Is this not correct? – Svensen 28/2, 2018 at 1:10

I am suspicious of the addition operation (actually not sure)... The architecture looks good.... Can you try applying relu at all the final deconv layers that you are adding and then softmax.. try that... even I am not sure what's wrong by just looking the code because the the code looks good and use tf.clip for clipping the gradients... experiment with those deconv layers... – Dowdy 28/2, 2018 at 2:8

@Dowdy the sigmoid layer loses the dynamic range of the logits. I don't see why that's necessary. Any reference? – Amylaceous 5/3, 2020 at 6:29

You need label smoothing.

I just had the same problem. I was training with tf.nn.sparse_softmax_cross_entropy_with_logits which is the same as if you use tf.nn.softmax_cross_entropy_with_logits with one-hot labels. My dataset predicts the occurrence of rare events so the labels in the training set are 99% class 0 and 1% class 1. My loss would start to fall, then stagnate (but with reasonable predictions), then suddenly explode and then the predictions also went bad.

Using the tf.summary ops to log internal network state into Tensorboard, I observed that the logits were growing and growing in absolute value. Eventually at >1e8, tf.nn.softmax_cross_entropy_with_logits became numerically unstable and that's what generated those weird loss spikes.

In my opinion, the reason why this happens is with the softmax function itself, which is in line with Jai's comment that putting a sigmoid in there before the softmax will fix things. But that will quite surely also make it impossible for the softmax likelihoods to be accurate, as it limits the value range of the logits. But in doing so, it prevents the overflow.

Softmax is defined as likelihood[i] = tf.exp(logit[i]) / tf.reduce_sum(tf.exp(logit[!=i])). Cross-entropy is defined as tf.reduce_sum(-label_likelihood[i] * tf.log(likelihood[i]) so if your labels are one-hot, that reduces to just the negative logarithm of your target likelihood. In practice, that means you're pushing likelihood[true_class] as close to 1.0 as you can. And due to the softmax, the only way to do that is if tf.exp(logit[!=true_class]) becomes as close to 0.0 as possible.

So in effect, you have asked the optimizer to produce tf.exp(x) == 0.0 and the only way to do that is by making x == - infinity. And that's why you get numerical instability.

The solution is to "blur" the labels so instead of [0,0,1] you use [0.01,0.01,0.98]. Now the optimizer works to reach tf.exp(x) == 0.01 which results in x == -4.6 which is safely inside the numerical range where GPU calculations are accurate and reliably.

Hush answered 10/12, 2021 at 8:44 Comment(0)

Not sure, what it causes it exactly. I had the same issue a few times. A few things generally help: You might reduce the learning rate, ie. the bound of the learning rate for Adam (eg. 1e-5 to 1e-7 or so) or try stochastic gradient descent. Adam tries to estimate learning rates which can lead to instable training: See Adam optimizer goes haywire after 200k batches, training loss grows

Once I also removed batchnorm and that actually helped, but this was for a "specially" designed network for stroke data (= point sequences), which was not very deep with Conv1d layers.

Allard answered 7/6, 2020 at 19:13 Comment(0)

Recommended topics

Hot tags