I'm implementing a neural network, and wanted to use ReLU as the activation function of the neurons. Furthermore, I'm training the network with SDG and back-propagation. I'm testing the neural network with the paradigmatic XOR problem, and up to now, it classifies new samples correctly if I use the logistic function or the hyperbolic tangent as activation functions.
I've been reading about the benefits of using the Leaky ReLU as activation function, and implemented it, in Python, like this:
def relu(data, epsilon=0.1):
return np.maximum(epsilon * data, data)
where np
is the name for NumPy. The associated derivative is implemented like this:
def relu_prime(data, epsilon=0.1):
if 1. * np.all(epsilon < data):
return 1
return epsilon
Using this function as activation I get incorrect results. For example:
Input = [0, 0] --> Output = [0.43951457]
Input = [0, 1] --> Output = [0.46252925]
Input = [1, 0] --> Output = [0.34939594]
Input = [1, 1] --> Output = [0.37241062]
It can be seen that the outputs differ greatly from the expected XOR ones. So the question would be, is there any special consideration to use ReLU as activation function?
Please, don't heasitate to ask me for more context or code. Thanks in advance.
EDIT: there is a bug in the derivative, as it only returns a single float value, and not a NumPy array. The correct code should be:
def relu_prime(data, epsilon=0.1):
gradients = 1. * (data > epsilon)
gradients[gradients == 0] = epsilon
return gradients
gradients = 1. * (data > epsilon)
makes sense? What's your definition of a leaky ReLU function? This would set the gradient equal to epsilon for some data values that are greater than zero. – Consignorepsilon
, while the 1's come from all the remaining values greater thanepsilon
. In this case, I'm usingepsilon = 0.1
. – Dutyboundf(x)
in that wikipedia section on Leaky ReLUs, I see a piecewise derivative of 1 when x > 0 and alpha otherwise. I could be missing something, though. – Consignorrelu_prime
. I've already used the 0.01 value in epsilon. I saw in other posts that the value ofepsilon
can be variable, as long as it is "small". – Dutyboundgradients == 0
will be True for values of x greater than 0 but less than epsilon, though, making the derivative epsilon for x values greater than 0 but less than epsilon. Does that follow from the f(x) definition? – Consignor