I'm using TensorFlow and I modified the tutorial example to take my RGB images.
The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding
print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())
as debug code to each loop, yields the following output:
Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.
I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.