How to create 2-layers neural network using TensorFlow and python on MNIST data

Asked 1/7, 2016 at 4:16 Answered 7/6, 2017 at 21:24

I'm a newbie in machine learning and I am following tensorflow's tutorial to create some simple Neural Networks which learn the MNIST data.

I have built a single layer network (following the tutotial), accuracy was about 0.92 which is ok for me. But then I added one more layer, the accuracy reduced to 0.113, which is very bad.

Below is the relation between 2 layers:

import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])

#layer 1
W1 = tf.Variable(tf.zeros([784, 100]))
b1 = tf.Variable(tf.zeros([100]))
y1 = tf.nn.softmax(tf.matmul(x, W1) + b1)

#layer 2
W2 = tf.Variable(tf.zeros([100, 10]))
b2 = tf.Variable(tf.zeros([10]))
y2 = tf.nn.softmax(tf.matmul(y1, W2) + b2)

#output
y = y2
y_ = tf.placeholder(tf.float32, [None, 10])

Is my structure fine? What is the reason that makes it perform so bad? How should I modify my network?

Melanie answered 1/7, 2016 at 4:16 Comment(0)

The input of the 2nd layer is the softmax of the output of the first layer. You don't want to do that.

You're forcing the sum of these values to be 1. If some value of tf.matmul(x, W1) + b1 is about 0 (and some certainly are) the softmax operation is lowering this value to be 0. Result: you're killing the gradient and nothing can flow trough these neurons.

If you remove the softmax between the layers (but leve it the softmax on the output layer if you want to consider the values as probability) your network will work fine.

Tl;dr:

import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])

#layer 1
W1 = tf.Variable(tf.zeros([784, 100]))
b1 = tf.Variable(tf.zeros([100]))
y1 = tf.matmul(x, W1) + b1 #remove softmax

#layer 2
W2 = tf.Variable(tf.zeros([100, 10]))
b2 = tf.Variable(tf.zeros([10]))
y2 = tf.nn.softmax(tf.matmul(y1, W2) + b2)

#output
y = y2
y_ = tf.placeholder(tf.float32, [None, 10])

Bellina answered 1/7, 2016 at 7:43 Comment(1)

Thank you @nessuno, you are right! Now I replace it by ReLU and it is working very well ^^ – Melanie 3/7, 2016 at 11:16

Came into exactly the same problem, gradients diverged and got a bunch of nan for the predicted y. Implemented what suggested by nessuno, unfortunately, the diverging gradients still not fixed.

Instead I've tried sigmoid as the activation function for layer 1, it worked! But for relu didn't work if initiate W1 and W2 as zero matrices, accuracy is only 0.1135 . In order to make both relu and sigmoid work, better randomize the initialization of W1 and W2. Here's the modified code

import tensorflow as tf

x = tf.placeholder(tf.float32, [None, 784])

# layer 1
with tf.variable_scope('layer1'):
    W1 = tf.get_variable('w1',[784,200],
                         initializer=tf.random_normal_initializer())
    b1 = tf.get_variable('b1',[1,],
                         initializer=tf.constant_initializer(0.0))
    y1 = tf.nn.sigmoid(tf.matmul(x, W1) + b1)
#   y1 = tf.nn.relu(tf.matmul(x, W1) + b1) # alternative choice for activation

# layer 2
with tf.variable_scope('layer2'):
    W2 = tf.get_variable('w2',[200,10],
                     initializer= tf.random_normal_nitializer())
    b2 = tf.get_variable('b2',[1,],
                         initializer=tf.constant_initializer(0.0))
    y2 = tf.nn.softmax(tf.matmul(y1, W2) + b2)

# output
y = y2
y_ = tf.placeholder(tf.float32, [None, 10])

I found this link is helpful, see question 2 part (c), which gives backpropagation derivatives for a basic 2-layer neural network. In my opinion, when users didn't specify any acivation function, just apply linear flow in layer 1, will end up with backprograting a gradient looks something like (sth)*W2^T*W1^T, and as we initilize both W1 and W2 to be zeros, their product is likely to be very small close to zero, which result in vanishing gradients.

UPDATE

This is from the Quora answer Ofir posted about good initial weights in a neural network.

The most common initializations are random initialization and Xavier initialization. Random initialization just samples each weight from a standard distribution (often a normal distribution) with low deviation. The low deviation allows you to bias the network towards the 'simple' 0 solution, without the bad repercussions of actually initializing the weights to 0.

Myriad answered 15/12, 2016 at 5:34 Comment(0)

I tried to run the code snippets above. Results below 90% was discarded and I never really felt sure I did what the comments above had. Here is my full code.

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])

#layer 1
W1 = tf.get_variable('w1', [784, 100], initializer=tf.random_normal_initializer())
b1 = tf.get_variable('b1', [1,], initializer=tf.random_normal_initializer())
y1 = tf.nn.sigmoid(tf.matmul(x, W1) + b1) 

#layer 2
W2 = tf.get_variable('w2',[100,10], initializer= 
tf.random_normal_initializer())
b2 = tf.get_variable('b2',[1,], initializer=tf.random_normal_initializer())
y2 = tf.nn.softmax(tf.matmul(y1, W2) + b2)

#output
y = y2
y_ = tf.placeholder(tf.float32, [None, 10])

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), 
reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.2).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

for _ in range(10000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: 
mnist.test.labels}))

By changing 10000 -> 200000 i reached 95,5%.

Nonferrous answered 7/6, 2017 at 21:24 Comment(3)

I worked a bit more and even repeating 95,5% was hard. This would be very nice of some professional could comment on. The result is way below the expected. The variations I've tried are the gradient parameter, number of repetitions, progressively reduces gradient parameter. Since there are close to 80000 variabels in the grid an almost perfect result is what I expected. I am aware about tensorflow.org/get_started/mnist/pros but that includes a large number of insightful usages. To see something this simple corrected would help more for me anyway. – Jeramyjerba 10/6, 2017 at 14:17

You can also achieve better results by initialising weights like that: W = tf.Variable(tf.random_uniform([784,100], -0.01, 0.01)). Doesn't sound like much but initializing weights close to 0 actually helps in this case. – Insider 17/6, 2017 at 11:26

i think it peaks at 20k an two layer – Greenman 13/1, 2019 at 10:5

Recommended topics

Hot tags