Correct backpropagation in simple perceptron

Given the simple OR gate problem:

or_input = np.array([[0,0], [0,1], [1,0], [1,1]])
or_output = np.array([[0,1,1,1]]).T

If we train a simple single-layered perceptron (without backpropagation), we could do something like this:

import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def cost(predicted, truth):
    return (truth - predicted)**2

or_input = np.array([[0,0], [0,1], [1,0], [1,1]])
or_output = np.array([[0,1,1,1]]).T

# Define the shape of the weight vector.
num_data, input_dim = or_input.shape
# Define the shape of the output vector. 
output_dim = len(or_output.T)

num_epochs = 50 # No. of times to iterate.
learning_rate = 0.03 # How large a step to take per iteration.

# Lets standardize and call our inputs X and outputs Y
X = or_input
Y = or_output
W = np.random.random((input_dim, output_dim))

for _ in range(num_epochs):
    layer0 = X
    # Forward propagation.
    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(X, W))

    # How much did we miss in the predictions?
    cost_error = cost(layer1, Y)

    # update weights
    W +=  - learning_rate * np.dot(layer0.T, cost_error)

# Expected output.
print(Y.tolist())
# On the training data
print([[int(prediction > 0.5)] for prediction in layer1])

[out]:

[[0], [1], [1], [1]]
[[0], [1], [1], [1]]

With backpropagation, to compute the d(cost)/d(X), are the follow steps correct?

compute the layer1 error by multiplying the cost error and the derivatives of the cost
then compute the layer1 delta by multiplying the layer 1 error and the derivatives of the sigmoid
then do a dot product between the inputs and the layer1 delta to get the differential of the i.e. d(cost)/d(X)

Then the d(cost)/d(X) is multiplied with the negative of the learning rate to perform gradient descent.

num_epochs = 0 # No. of times to iterate.
learning_rate = 0.03 # How large a step to take per iteration.

# Lets standardize and call our inputs X and outputs Y
X = or_input
Y = or_output
W = np.random.random((input_dim, output_dim))

for _ in range(num_epochs):
    layer0 = X
    # Forward propagation.
    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(X, W))

    # How much did we miss in the predictions?
    cost_error = cost(layer1, Y)

    # Back propagation.
    # multiply how much we missed from the gradient/slope of the cost for our prediction.
    layer1_error = cost_error * cost_derivative(cost_error)

    # multiply how much we missed by the gradient/slope of the sigmoid at the values in layer1
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W +=  - learning_rate * np.dot(layer0.T, layer1_delta)

In that case, should the implementation look like this below with the cost_derivative and sigmoid_derivative?

import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

def cost(predicted, truth):
    return (truth - predicted)**2

def cost_derivative(y):
    # If the cost is:
    # cost = y - y_hat
    # What's the derivative of d(cost)/d(y)
    # d(cost)/d(y) = 1
    return 2*y


or_input = np.array([[0,0], [0,1], [1,0], [1,1]])
or_output = np.array([[0,1,1,1]]).T

# Define the shape of the weight vector.
num_data, input_dim = or_input.shape
# Define the shape of the output vector. 
output_dim = len(or_output.T)

num_epochs = 5 # No. of times to iterate.
learning_rate = 0.03 # How large a step to take per iteration.

# Lets standardize and call our inputs X and outputs Y
X = or_input
Y = or_output
W = np.random.random((input_dim, output_dim))

for _ in range(num_epochs):
    layer0 = X
    # Forward propagation.
    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(X, W))

    # How much did we miss in the predictions?
    cost_error = cost(layer1, Y)

    # Back propagation.
    # multiply how much we missed from the gradient/slope of the cost for our prediction.
    layer1_error = cost_error * cost_derivative(cost_error)

    # multiply how much we missed by the gradient/slope of the sigmoid at the values in layer1
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W +=  - learning_rate * np.dot(layer0.T, layer1_delta)

# Expected output.
print(Y.tolist())
# On the training data
print([[int(prediction > 0.5)] for prediction in layer1])

[out]:

[[0], [1], [1], [1]]
[[0], [1], [1], [1]]

BTW, given the random input seeds, even without the W and gradient descent or perceptron, the prediction can be still right:

import numpy as np
np.random.seed(0)

# Lets standardize and call our inputs X and outputs Y
X = or_input
Y = or_output
W = np.random.random((input_dim, output_dim))

# On the training data
predictions = sigmoid(np.dot(X, W))
[[int(prediction > 0.5)] for prediction in predictions]

You are almost correct. In your implementation, you define the cost as the square of the error, which as the unfortunate consequence of being always positive. As a result, if you plot the mean(cost_error), it is raising slowly at each iteration, and your weights are slowly decreasing.

In your particular case, you can have any weights >0 to make it work : if you try your implementation with enough epochs, your weights will turn negative and your network won't work anymore.

You can just remove the square in your cost function :

def cost(predicted, truth):
    return (truth - predicted)

Now to update your weights, you need to evaluate the gradient at the "position" of your error. So what your need is :

d_predicted = output_errors * sigmoid_derivative(predicted_output)

Next, we update the weights :

W += np.dot(X.T, d_predicted) * learning_rate

Full code with error display :

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

def cost(predicted, truth):
    return (truth - predicted)

or_input = np.array([[0,0], [0,1], [1,0], [1,1]])
or_output = np.array([[0,1,1,1]]).T

# Define the shape of the weight vector.
num_data, input_dim = or_input.shape
# Define the shape of the output vector. 
output_dim = len(or_output.T)

num_epochs = 50 # No. of times to iterate.
learning_rate = 0.1 # How large a step to take per iteration.

# Lets standardize and call our inputs X and outputs Y
X = or_input
Y = or_output
W = np.random.random((input_dim, output_dim))

# W = [[-1],[1]] # you can try to set bad weights to see the training process
error_list = []

for _ in range(num_epochs):
    layer0 = X
    # Forward propagation.
    layer1 = sigmoid(np.dot(X, W))

    # How much did we miss in the predictions?
    cost_error = cost(layer1, Y)
    error_list.append(np.mean(cost_error)) # save the loss to plot later

    # Back propagation.
    # eval the gradient :
    d_predicted = cost_error * sigmoid_derivative(cost_error)

    # update weights
    W = W + np.dot(X.T, d_predicted) * learning_rate


# Expected output.
print(Y.tolist())
# On the training data
print([[int(prediction > 0.5)] for prediction in layer1])

# plot error curve : 
plt.plot(range(num_epochs), loss_list, '+b')
plt.xlabel('Epoch')
plt.ylabel('mean error')

I also added a line to set the initial weights manually, to see how the network is learning

Recommended topics

Hot tags