TensorFlow MLP not training XOR

Asked 30/11, 2015 at 11:42 Answered 1/12, 2015 at 14:28

Solved python machine-learning neural-network tensorflow supervised-learning

I've built an MLP with Google's TensorFlow library. The network is working but somehow it refuses to learn properly. It always converges to an output of nearly 1.0 no matter what the input actually is.

The complete code can be seen here.

Any ideas?

The input and output (batch size 4) is as follows:

input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]  # XOR input
output_data = [[0.], [1.], [1.], [0.]]  # XOR output

n_input = tf.placeholder(tf.float32, shape=[None, 2], name="n_input")
n_output = tf.placeholder(tf.float32, shape=[None, 1], name="n_output")

Hidden layer configuration:

# hidden layer's bias neuron
b_hidden = tf.Variable(0.1, name="hidden_bias")

# hidden layer's weight matrix initialized with a uniform distribution
W_hidden = tf.Variable(tf.random_uniform([2, hidden_nodes], -1.0, 1.0), name="hidden_weights")

# calc hidden layer's activation
hidden = tf.sigmoid(tf.matmul(n_input, W_hidden) + b_hidden)

Output layer configuration:

W_output = tf.Variable(tf.random_uniform([hidden_nodes, 1], -1.0, 1.0), name="output_weights")  # output layer's weight matrix
output = tf.sigmoid(tf.matmul(hidden, W_output))  # calc output layer's activation

My learning methods look like this:

loss = tf.reduce_mean(cross_entropy)  # mean the cross_entropy
optimizer = tf.train.GradientDescentOptimizer(0.01)  # take a gradient descent for optimizing
train = optimizer.minimize(loss)  # let the optimizer train

I tried both setups for cross entropy:

cross_entropy = -tf.reduce_sum(n_output * tf.log(output))

and

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(n_output, output)

where n_output is the original output as described in output_data and output the predicted/calculated value by my network.

The training inside the for-loop (for n epochs) goes like this:

cvalues = sess.run([train, loss, W_hidden, b_hidden, W_output],
                   feed_dict={n_input: input_data, n_output: output_data})

I am saving the outcome to cvalues for debug printig of loss, W_hidden, ...

No matter what I've tried, when I test my network, trying to validate the output, it always produces something like this:

(...)
step: 2000
loss: 0.0137040186673
b_hidden: 1.3272010088
W_hidden: [[ 0.23195425  0.53248233 -0.21644847 -0.54775208  0.52298909]
 [ 0.73933059  0.51440752 -0.08397482 -0.62724304 -0.53347367]]
W_output: [[ 1.65939867]
 [ 0.78912479]
 [ 1.4831928 ]
 [ 1.28612828]
 [ 1.12486529]]

(--- finished with 2000 epochs ---)

(Test input for validation:)

input: [0.0, 0.0] | output: [[ 0.99339396]]
input: [0.0, 1.0] | output: [[ 0.99289012]]
input: [1.0, 0.0] | output: [[ 0.99346077]]
input: [1.0, 1.0] | output: [[ 0.99261558]]

So it is not learning properly but always converging to nearly 1.0 no matter which input is fed.

Skull answered 30/11, 2015 at 11:42 Comment(2)

Your b_hidden variable is a scalar - is that intentional? I think you should create it as b_hidden = tf.Variable(tf.constant(0.1, shape=[hidden_nodes]), name="hidden_bias"), which might help. Another thing to try would be adding a b_output bias term to your output layer. – Yoshi 30/11, 2015 at 15:53

Thank you for the comment. Indeed I just failed to notice that b_hidden should be a vector also and not a scalar...however, the network still converges to nearly 1.0 for every input, with or without a hidden bias, as a scalar or a vector and with or without a bias for the output layer. I really think I am missing some error in the learning method or network architecture :/ – Skull 30/11, 2015 at 16:48

In the meanwhile with the help of a colleague I were able to fix my solution and wanted to post it for completeness. My solution works with cross entropy and without altering the training data. Additionally it has the desired input shape of (1, 2) and ouput is scalar.

It makes use of an AdamOptimizer which decreases the error much faster than a GradientDescentOptimizer. See this post for more information (& questions^^) about the optimizer.

In fact, my network produces reasonably good results in only 400-800 learning steps.

After 2000 learning steps the output is nearly "perfect":

step: 2000
loss: 0.00103311243281

input: [0.0, 0.0] | output: [[ 0.00019799]]
input: [0.0, 1.0] | output: [[ 0.99979786]]
input: [1.0, 0.0] | output: [[ 0.99996307]]
input: [1.0, 1.0] | output: [[ 0.00033751]]

import tensorflow as tf    

#####################
# preparation stuff #
#####################

# define input and output data
input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]  # XOR input
output_data = [[0.], [1.], [1.], [0.]]  # XOR output

# create a placeholder for the input
# None indicates a variable batch size for the input
# one input's dimension is [1, 2] and output's [1, 1]
n_input = tf.placeholder(tf.float32, shape=[None, 2], name="n_input")
n_output = tf.placeholder(tf.float32, shape=[None, 1], name="n_output")

# number of neurons in the hidden layer
hidden_nodes = 5


################
# hidden layer #
################

# hidden layer's bias neuron
b_hidden = tf.Variable(tf.random_normal([hidden_nodes]), name="hidden_bias")

# hidden layer's weight matrix initialized with a uniform distribution
W_hidden = tf.Variable(tf.random_normal([2, hidden_nodes]), name="hidden_weights")

# calc hidden layer's activation
hidden = tf.sigmoid(tf.matmul(n_input, W_hidden) + b_hidden)


################
# output layer #
################

W_output = tf.Variable(tf.random_normal([hidden_nodes, 1]), name="output_weights")  # output layer's weight matrix
output = tf.sigmoid(tf.matmul(hidden, W_output))  # calc output layer's activation


############
# learning #
############
cross_entropy = -(n_output * tf.log(output) + (1 - n_output) * tf.log(1 - output))
# cross_entropy = tf.square(n_output - output)  # simpler, but also works

loss = tf.reduce_mean(cross_entropy)  # mean the cross_entropy
optimizer = tf.train.AdamOptimizer(0.01)  # take a gradient descent for optimizing with a "stepsize" of 0.1
train = optimizer.minimize(loss)  # let the optimizer train


####################
# initialize graph #
####################
init = tf.initialize_all_variables()

sess = tf.Session()  # create the session and therefore the graph
sess.run(init)  # initialize all variables  

#####################
# train the network #
#####################
for epoch in xrange(0, 2001):
    # run the training operation
    cvalues = sess.run([train, loss, W_hidden, b_hidden, W_output],
                       feed_dict={n_input: input_data, n_output: output_data})

    # print some debug stuff
    if epoch % 200 == 0:
        print("")
        print("step: {:>3}".format(epoch))
        print("loss: {}".format(cvalues[1]))
        # print("b_hidden: {}".format(cvalues[3]))
        # print("W_hidden: {}".format(cvalues[2]))
        # print("W_output: {}".format(cvalues[4]))


print("")
print("input: {} | output: {}".format(input_data[0], sess.run(output, feed_dict={n_input: [input_data[0]]})))
print("input: {} | output: {}".format(input_data[1], sess.run(output, feed_dict={n_input: [input_data[1]]})))
print("input: {} | output: {}".format(input_data[2], sess.run(output, feed_dict={n_input: [input_data[2]]})))
print("input: {} | output: {}".format(input_data[3], sess.run(output, feed_dict={n_input: [input_data[3]]})))

Skull answered 1/12, 2015 at 14:28 Comment(0)

I can't comment because I don't have enough reputation but I have some questions on that answer mrry. The $L_2$ loss function makes sense because it is basically the MSE function, but why wouldn't cross-entropy work? Certainly works for other NN libs. Second of all why in the world would translating your input space from $[0,1] -> [-1,1]$ have any affect especially since you added bias vectors.

EDIT This is a solution using cross entropy and one-hot compiled from multiple sources EDIT^2 changed the code to use cross-entropy without any extra encoding or any weird target value shifting

import math
import tensorflow as tf
import numpy as np

HIDDEN_NODES = 10

x = tf.placeholder(tf.float32, [None, 2])
W_hidden = tf.Variable(tf.truncated_normal([2, HIDDEN_NODES]))
b_hidden = tf.Variable(tf.zeros([HIDDEN_NODES]))
hidden = tf.nn.relu(tf.matmul(x, W_hidden) + b_hidden)

W_logits = tf.Variable(tf.truncated_normal([HIDDEN_NODES, 1]))
b_logits = tf.Variable(tf.zeros([1]))
logits = tf.add(tf.matmul(hidden, W_logits),b_logits)


y = tf.nn.sigmoid(logits)


y_input = tf.placeholder(tf.float32, [None, 1])



loss = -(y_input * tf.log(y) + (1 - y_input) * tf.log(1 - y))

train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

init_op = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init_op)

xTrain = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])


yTrain = np.array([[0], [1], [1], [0]])


for i in xrange(2000):
  _, loss_val,logitsval = sess.run([train_op, loss,logits], feed_dict={x: xTrain, y_input: yTrain})

  if i % 10 == 0:
    print "Step:", i, "Current loss:", loss_val,"logits",logitsval

print "---------"
print sess.run(y,feed_dict={x: xTrain})

Underwing answered 30/11, 2015 at 22:19 Comment(10)

Using cross entropy to solve XOR as a classification problem is certainly possible (and I answered a previous question about that: #33748096). The question was posed as a regression problem, for which MSE is more appropriate. I'm not exactly sure why rescaling the input data is necessary, but perhaps it is getting stuck in a local minimum? – Yoshi 30/11, 2015 at 23:47

Well maybe but does the XOR error surface include local min? Or is there only one local min i.e. global min – Underwing 1/12, 2015 at 0:1

Also: why does this not work without one-hot! If you maybe the targets 1 dimensional and change the corresponding weight matrices it doesn't work--blows up to NaNs --- I am not sure about this whole tensorflow seems like theano might be more suited towards NN – Underwing 1/12, 2015 at 0:2

I just edited my answer to point out that the rescaling isn't strictly necessary (it just takes much longer to converge at the given learning rate). Not sure what you mean about needing one-hot though: my answer doesn't use it. Perhaps you could pose your problem as a separate question? – Yoshi 1/12, 2015 at 1:31

Sure, why does yTrain = np.array([[1, 0], [0, 1], [0, 1], [1, 0]]) need to be two dimensional when the desired output is 1-d [0,1,1,0]? – Underwing 1/12, 2015 at 1:39

#34011435 – Underwing 1/12, 2015 at 1:48

This solution looks interesting and to me it makes more sense because of the unaltered training data. However, could you explain what logits exactly is and what softmax_cross_entropy_with_logits(logits, y_input) does? I would've expected that we want to somehow model the difference between the output y and the expected/real output? – Skull 1/12, 2015 at 14:20

We do-- y_input is just poorly name variable :) its the expected output – Underwing 1/12, 2015 at 14:27

Ah, ok. Thanks for the clarification. However could you add an explanation why logits / softmax_cross_entropy_with_logits is suitable for our problem? – Skull 1/12, 2015 at 16:28

@ascenator just updated the code to use cross-entropy with sigmoid activation based on the actual definition of cross-entropy so hopefully thats clear! – Underwing 1/12, 2015 at 23:51

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags