TensorFlow MLP not training XOR
Asked Answered
S

2

5

I've built an MLP with Google's TensorFlow library. The network is working but somehow it refuses to learn properly. It always converges to an output of nearly 1.0 no matter what the input actually is.

The complete code can be seen here.

Any ideas?


The input and output (batch size 4) is as follows:

input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]  # XOR input
output_data = [[0.], [1.], [1.], [0.]]  # XOR output

n_input = tf.placeholder(tf.float32, shape=[None, 2], name="n_input")
n_output = tf.placeholder(tf.float32, shape=[None, 1], name="n_output")

Hidden layer configuration:

# hidden layer's bias neuron
b_hidden = tf.Variable(0.1, name="hidden_bias")

# hidden layer's weight matrix initialized with a uniform distribution
W_hidden = tf.Variable(tf.random_uniform([2, hidden_nodes], -1.0, 1.0), name="hidden_weights")

# calc hidden layer's activation
hidden = tf.sigmoid(tf.matmul(n_input, W_hidden) + b_hidden)

Output layer configuration:

W_output = tf.Variable(tf.random_uniform([hidden_nodes, 1], -1.0, 1.0), name="output_weights")  # output layer's weight matrix
output = tf.sigmoid(tf.matmul(hidden, W_output))  # calc output layer's activation

My learning methods look like this:

loss = tf.reduce_mean(cross_entropy)  # mean the cross_entropy
optimizer = tf.train.GradientDescentOptimizer(0.01)  # take a gradient descent for optimizing
train = optimizer.minimize(loss)  # let the optimizer train

I tried both setups for cross entropy:

cross_entropy = -tf.reduce_sum(n_output * tf.log(output))

and

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(n_output, output)

where n_output is the original output as described in output_data and output the predicted/calculated value by my network.


The training inside the for-loop (for n epochs) goes like this:

cvalues = sess.run([train, loss, W_hidden, b_hidden, W_output],
                   feed_dict={n_input: input_data, n_output: output_data})

I am saving the outcome to cvalues for debug printig of loss, W_hidden, ...


No matter what I've tried, when I test my network, trying to validate the output, it always produces something like this:

(...)
step: 2000
loss: 0.0137040186673
b_hidden: 1.3272010088
W_hidden: [[ 0.23195425  0.53248233 -0.21644847 -0.54775208  0.52298909]
 [ 0.73933059  0.51440752 -0.08397482 -0.62724304 -0.53347367]]
W_output: [[ 1.65939867]
 [ 0.78912479]
 [ 1.4831928 ]
 [ 1.28612828]
 [ 1.12486529]]

(--- finished with 2000 epochs ---)

(Test input for validation:)

input: [0.0, 0.0] | output: [[ 0.99339396]]
input: [0.0, 1.0] | output: [[ 0.99289012]]
input: [1.0, 0.0] | output: [[ 0.99346077]]
input: [1.0, 1.0] | output: [[ 0.99261558]]

So it is not learning properly but always converging to nearly 1.0 no matter which input is fed.

Skull answered 30/11, 2015 at 11:42 Comment(2)
Your b_hidden variable is a scalar - is that intentional? I think you should create it as b_hidden = tf.Variable(tf.constant(0.1, shape=[hidden_nodes]), name="hidden_bias"), which might help. Another thing to try would be adding a b_output bias term to your output layer.Yoshi
Thank you for the comment. Indeed I just failed to notice that b_hidden should be a vector also and not a scalar...however, the network still converges to nearly 1.0 for every input, with or without a hidden bias, as a scalar or a vector and with or without a bias for the output layer. I really think I am missing some error in the learning method or network architecture :/Skull
S
9

In the meanwhile with the help of a colleague I were able to fix my solution and wanted to post it for completeness. My solution works with cross entropy and without altering the training data. Additionally it has the desired input shape of (1, 2) and ouput is scalar.

It makes use of an AdamOptimizer which decreases the error much faster than a GradientDescentOptimizer. See this post for more information (& questions^^) about the optimizer.

In fact, my network produces reasonably good results in only 400-800 learning steps.

After 2000 learning steps the output is nearly "perfect":

step: 2000
loss: 0.00103311243281

input: [0.0, 0.0] | output: [[ 0.00019799]]
input: [0.0, 1.0] | output: [[ 0.99979786]]
input: [1.0, 0.0] | output: [[ 0.99996307]]
input: [1.0, 1.0] | output: [[ 0.00033751]]

import tensorflow as tf    

#####################
# preparation stuff #
#####################

# define input and output data
input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]  # XOR input
output_data = [[0.], [1.], [1.], [0.]]  # XOR output

# create a placeholder for the input
# None indicates a variable batch size for the input
# one input's dimension is [1, 2] and output's [1, 1]
n_input = tf.placeholder(tf.float32, shape=[None, 2], name="n_input")
n_output = tf.placeholder(tf.float32, shape=[None, 1], name="n_output")

# number of neurons in the hidden layer
hidden_nodes = 5


################
# hidden layer #
################

# hidden layer's bias neuron
b_hidden = tf.Variable(tf.random_normal([hidden_nodes]), name="hidden_bias")

# hidden layer's weight matrix initialized with a uniform distribution
W_hidden = tf.Variable(tf.random_normal([2, hidden_nodes]), name="hidden_weights")

# calc hidden layer's activation
hidden = tf.sigmoid(tf.matmul(n_input, W_hidden) + b_hidden)


################
# output layer #
################

W_output = tf.Variable(tf.random_normal([hidden_nodes, 1]), name="output_weights")  # output layer's weight matrix
output = tf.sigmoid(tf.matmul(hidden, W_output))  # calc output layer's activation


############
# learning #
############
cross_entropy = -(n_output * tf.log(output) + (1 - n_output) * tf.log(1 - output))
# cross_entropy = tf.square(n_output - output)  # simpler, but also works

loss = tf.reduce_mean(cross_entropy)  # mean the cross_entropy
optimizer = tf.train.AdamOptimizer(0.01)  # take a gradient descent for optimizing with a "stepsize" of 0.1
train = optimizer.minimize(loss)  # let the optimizer train


####################
# initialize graph #
####################
init = tf.initialize_all_variables()

sess = tf.Session()  # create the session and therefore the graph
sess.run(init)  # initialize all variables  

#####################
# train the network #
#####################
for epoch in xrange(0, 2001):
    # run the training operation
    cvalues = sess.run([train, loss, W_hidden, b_hidden, W_output],
                       feed_dict={n_input: input_data, n_output: output_data})

    # print some debug stuff
    if epoch % 200 == 0:
        print("")
        print("step: {:>3}".format(epoch))
        print("loss: {}".format(cvalues[1]))
        # print("b_hidden: {}".format(cvalues[3]))
        # print("W_hidden: {}".format(cvalues[2]))
        # print("W_output: {}".format(cvalues[4]))


print("")
print("input: {} | output: {}".format(input_data[0], sess.run(output, feed_dict={n_input: [input_data[0]]})))
print("input: {} | output: {}".format(input_data[1], sess.run(output, feed_dict={n_input: [input_data[1]]})))
print("input: {} | output: {}".format(input_data[2], sess.run(output, feed_dict={n_input: [input_data[2]]})))
print("input: {} | output: {}".format(input_data[3], sess.run(output, feed_dict={n_input: [input_data[3]]})))
Skull answered 1/12, 2015 at 14:28 Comment(0)
U
0

I can't comment because I don't have enough reputation but I have some questions on that answer mrry. The $L_2$ loss function makes sense because it is basically the MSE function, but why wouldn't cross-entropy work? Certainly works for other NN libs. Second of all why in the world would translating your input space from $[0,1] -> [-1,1]$ have any affect especially since you added bias vectors.

EDIT This is a solution using cross entropy and one-hot compiled from multiple sources EDIT^2 changed the code to use cross-entropy without any extra encoding or any weird target value shifting

import math
import tensorflow as tf
import numpy as np

HIDDEN_NODES = 10

x = tf.placeholder(tf.float32, [None, 2])
W_hidden = tf.Variable(tf.truncated_normal([2, HIDDEN_NODES]))
b_hidden = tf.Variable(tf.zeros([HIDDEN_NODES]))
hidden = tf.nn.relu(tf.matmul(x, W_hidden) + b_hidden)

W_logits = tf.Variable(tf.truncated_normal([HIDDEN_NODES, 1]))
b_logits = tf.Variable(tf.zeros([1]))
logits = tf.add(tf.matmul(hidden, W_logits),b_logits)


y = tf.nn.sigmoid(logits)


y_input = tf.placeholder(tf.float32, [None, 1])



loss = -(y_input * tf.log(y) + (1 - y_input) * tf.log(1 - y))

train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

init_op = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init_op)

xTrain = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])


yTrain = np.array([[0], [1], [1], [0]])


for i in xrange(2000):
  _, loss_val,logitsval = sess.run([train_op, loss,logits], feed_dict={x: xTrain, y_input: yTrain})

  if i % 10 == 0:
    print "Step:", i, "Current loss:", loss_val,"logits",logitsval

print "---------"
print sess.run(y,feed_dict={x: xTrain})
Underwing answered 30/11, 2015 at 22:19 Comment(10)
Using cross entropy to solve XOR as a classification problem is certainly possible (and I answered a previous question about that: #33748096). The question was posed as a regression problem, for which MSE is more appropriate. I'm not exactly sure why rescaling the input data is necessary, but perhaps it is getting stuck in a local minimum?Yoshi
Well maybe but does the XOR error surface include local min? Or is there only one local min i.e. global minUnderwing
Also: why does this not work without one-hot! If you maybe the targets 1 dimensional and change the corresponding weight matrices it doesn't work--blows up to NaNs --- I am not sure about this whole tensorflow seems like theano might be more suited towards NNUnderwing
I just edited my answer to point out that the rescaling isn't strictly necessary (it just takes much longer to converge at the given learning rate). Not sure what you mean about needing one-hot though: my answer doesn't use it. Perhaps you could pose your problem as a separate question?Yoshi
Sure, why does yTrain = np.array([[1, 0], [0, 1], [0, 1], [1, 0]]) need to be two dimensional when the desired output is 1-d [0,1,1,0]?Underwing
#34011435Underwing
This solution looks interesting and to me it makes more sense because of the unaltered training data. However, could you explain what logits exactly is and what softmax_cross_entropy_with_logits(logits, y_input) does? I would've expected that we want to somehow model the difference between the output y and the expected/real output?Skull
We do-- y_input is just poorly name variable :) its the expected outputUnderwing
Ah, ok. Thanks for the clarification. However could you add an explanation why logits / softmax_cross_entropy_with_logits is suitable for our problem?Skull
@ascenator just updated the code to use cross-entropy with sigmoid activation based on the actual definition of cross-entropy so hopefully thats clear!Underwing

© 2022 - 2024 — McMap. All rights reserved.