Came into exactly the same problem, gradients diverged and got a bunch of nan
for the predicted y
. Implemented what suggested by nessuno, unfortunately, the diverging gradients still not fixed.
Instead I've tried sigmoid
as the activation function for layer 1, it worked! But for relu
didn't work if initiate W1
and W2
as zero matrices, accuracy is only 0.1135 . In order to make both relu
and sigmoid
work, better randomize the initialization of W1
and W2
. Here's the modified code
import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])
# layer 1
with tf.variable_scope('layer1'):
W1 = tf.get_variable('w1',[784,200],
initializer=tf.random_normal_initializer())
b1 = tf.get_variable('b1',[1,],
initializer=tf.constant_initializer(0.0))
y1 = tf.nn.sigmoid(tf.matmul(x, W1) + b1)
# y1 = tf.nn.relu(tf.matmul(x, W1) + b1) # alternative choice for activation
# layer 2
with tf.variable_scope('layer2'):
W2 = tf.get_variable('w2',[200,10],
initializer= tf.random_normal_nitializer())
b2 = tf.get_variable('b2',[1,],
initializer=tf.constant_initializer(0.0))
y2 = tf.nn.softmax(tf.matmul(y1, W2) + b2)
# output
y = y2
y_ = tf.placeholder(tf.float32, [None, 10])
I found this link is helpful, see question 2 part (c), which gives backpropagation derivatives for a basic 2-layer neural network. In my opinion, when users didn't specify any acivation function, just apply linear flow in layer 1, will end up with backprograting a gradient looks something like (sth)*W2^T*W1^T
, and as we initilize both W1
and W2
to be zeros, their product is likely to be very small close to zero, which result in vanishing gradients.
UPDATE
This is from the Quora answer Ofir posted about good initial weights in a neural network.
The most common initializations are random initialization and Xavier
initialization. Random initialization just samples each weight from a
standard distribution (often a normal distribution) with low
deviation. The low deviation allows you to bias the network towards
the 'simple' 0 solution, without the bad repercussions of actually
initializing the weights to 0.