Intermediate layer makes tensorflow optimizer to stop working

Asked 15/2, 2018 at 21:49 Answered 1/4, 2020 at 2:39

Solved python tensorflow machine-learning deep-learning autoencoder

This graph trains a simple signal identity encoder, and in fact shows that the weights are being evolved by the optimizer:

import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)

DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output

#W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
#b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
#O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')

W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O, W2) + b2

O2_0 = tf.gather_nd(O2, [[0,0]])

estimate0 = 2.0*O2_0

eval_inp = tf.gather_nd(I,[[0,0]])
k = 1e-5
L = 5.0
distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )

opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, #W1, b1,
  W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]

train_op = opt.apply_gradients(clipped_grads_and_vars)

saver = tf.train.Saver()
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init_op)
  for i in range(10000):
    print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
  for i in range(10):
    print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

However, when I uncomment the intermediate hidden layer and train the resulting network, I see that the weights are not evolving anymore:

import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)

DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output

W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')

W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O1, W2) + b2

O2_0 = tf.gather_nd(O2, [[0,0]])

estimate0 = 2.0*O2_0

eval_inp = tf.gather_nd(I,[[0,0]])

distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )

opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, W1, b1,
  W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]

train_op = opt.apply_gradients(clipped_grads_and_vars)

saver = tf.train.Saver()
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init_op)
  for i in range(10000):
    print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
  for i in range(10):
    print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

The evaluation of estimate0 converging quickly in some fixed value that becomes independient from the input signal. I have no idea why this is happening

Question:

Any idea what might be wrong with the second example?

Disputant answered 15/2, 2018 at 21:49 Comment(3)

How are you verifying the weights are "evolving"? – Ti 21/2, 2018 at 22:42

@EvanWeissburg in the second example W values barely change, distance does not get smaller and in the inference loop estimate0 barely changes value with different inputs. In first example W change, distance become of the order of 1e-5 in a hundred steps and estimate0 closely tracks the input value – Disputant 21/2, 2018 at 23:1

The answer below is very good. Another hint: try some other optimizer like Adam instead of plain Gradient Descent. You could even try another activation function like leaky relu for example. – Euhemerism 28/2, 2018 at 12:50

TL;DR: the deeper the neural network becomes, the more you should pay attention to the gradient flow (see this discussion of "vanishing gradients"). One particular case is variables initialization.

Problem analysis

I've added tensorboard summaries for the variables and gradients into both of your scripts and got the following:

2-layer network

3-layer network

The charts show the distributions of W:0 variable (the first layer) and how they are changed from 0 epoch to 1000 (clickable). Indeed, we can see, the rate of change is much higher in a 2-layer network. But I'd like to pay attention to the gradient distribution, which is much closer to 0 in a 3-layer network (first variance is around 0.005, the second one is around 0.000002, i.e. 1000 times smaller). This is the vanishing gradient problem.

Here's the helper code if you're interested:

for g, v in grads_and_vars:
  tf.summary.histogram(v.name, v)
  tf.summary.histogram(v.name + '_grad', g)

merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())

...

_, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
if i % 10 == 0:
  writer.add_summary(summary, global_step=i)

Solution

All deep networks suffer from this to some extent and there is no universal solution that will auto-magically fix any network. But there are some techniques that can push it in the right direction. Initialization is one of them.

I replaced your normal initialization with:

W_init = tf.contrib.layers.xavier_initializer()
b_init = tf.constant_initializer(0.1)

There are lots of tutorials on Xavier init, you can take a look at this one, for example. Note that I set the bias init to be slightly positive to make sure that ReLu outputs are positive for the most of neurons, at least in the beginning.

This changed the picture immediately:

The weights are still not moving quite as fast as before, but they are moving (note the scale of W:0 values) and the gradients distribution became much less peaked at 0, thus much better.

Of course, it's not the end. To improve it further, you should implement the full autoencoder, because currently the loss is affected by the [0,0] element reconstruction, so most outputs aren't used in optimization. You can also play with different optimizers (Adam would be my choice) and the learning rates.

Rosenbaum answered 22/2, 2018 at 13:2 Comment(9)

this is why i use keras and not tensorflow directly - sensible defaults – Aguedaaguero 27/2, 2018 at 20:22

thank you for this response, it sent me on the right track – Disputant 28/2, 2018 at 16:21

What do yo mean by that @denfromufa. What are sensible defaults in tensorflow? You always have to set the initializer and stuff like that yourself and choose the right optimizer. – Reshape 21/3, 2018 at 9:29

@Rosenbaum I cannot really see the difference between your result after xavier initialization and before. The weights seem to be the same whereas the gradient changes a tiny bit. But where is the big difference? – Reshape 21/3, 2018 at 11:43

@thigi pay attention to the variance of grad distribution. It jumped from ~0.000002 to ~0.1. That is more than enough for NN to learn – Rosenbaum 21/3, 2018 at 11:52

okay, so the peak itself is okay. I get it! What about the kernel weights that seem to stay the same? they should change in order to make it "better" --> lower loss – Reshape 21/3, 2018 at 11:55

If the grad hasn't collapsed to 0, how can they stay the same. It's ok if the changes are not dramatic, you can always boost up the learning rate a bit if you'd like. They are changing as long as the grads haven't vanished and that's fine. – Rosenbaum 21/3, 2018 at 11:58

Okay, and what is a practical lower bound for the gradients? I have some gradients with variance of 0.001. Is that already vanished? – Reshape 21/3, 2018 at 11:59

0.001 looks rather small to me, especially if you feel that the NN has a long way to learn. I think you should dig further and see why exactly they are so small. Activation? Dying relu? Add extra batch norm? – Rosenbaum 21/3, 2018 at 12:4

That looks very exciting. Where exactly does this code belong? I've only recently discovered TensorBoard

is this in callbacks somehow:

  for g, v in grads_and_vars:
  tf.summary.histogram(v.name, v)
  tf.summary.histogram(v.name + '_grad', g)

merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())

is this after fiting:

_, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
if i % 10 == 0:
  writer.add_summary(summary, global_step=i)

Damson answered 1/4, 2020 at 2:39 Comment(0)

Problem analysis

Solution

Recommended topics

Hot tags