Considerations for using ReLU as activation function

Asked 8/1, 2017 at 23:27 Answered 9/1, 2017 at 14:16

Solved python numpy machine-learning neural-network activation-function

I'm implementing a neural network, and wanted to use ReLU as the activation function of the neurons. Furthermore, I'm training the network with SDG and back-propagation. I'm testing the neural network with the paradigmatic XOR problem, and up to now, it classifies new samples correctly if I use the logistic function or the hyperbolic tangent as activation functions.

I've been reading about the benefits of using the Leaky ReLU as activation function, and implemented it, in Python, like this:

def relu(data, epsilon=0.1):
    return np.maximum(epsilon * data, data)

where np is the name for NumPy. The associated derivative is implemented like this:

def relu_prime(data, epsilon=0.1):
    if 1. * np.all(epsilon < data):
        return 1
    return epsilon

Using this function as activation I get incorrect results. For example:

Input = [0, 0] --> Output = [0.43951457]
Input = [0, 1] --> Output = [0.46252925]
Input = [1, 0] --> Output = [0.34939594]
Input = [1, 1] --> Output = [0.37241062]

It can be seen that the outputs differ greatly from the expected XOR ones. So the question would be, is there any special consideration to use ReLU as activation function?

Please, don't heasitate to ask me for more context or code. Thanks in advance.

EDIT: there is a bug in the derivative, as it only returns a single float value, and not a NumPy array. The correct code should be:

def relu_prime(data, epsilon=0.1):
    gradients = 1. * (data > epsilon)
    gradients[gradients == 0] = epsilon
    return gradients

Dutybound answered 8/1, 2017 at 23:27 Comment(8)

did it work after modifying the gradient calculation part? – Inclined 9/1, 2017 at 8:24

@KrishnaKishoreAndhavarapu After modifying it I get correct results, but like 5 out of 10 times. I believe that I should get correct results every time. There is clearly something I'm missing with this activation function. – Dutybound 9/1, 2017 at 12:47

Are you sure gradients = 1. * (data > epsilon) makes sense? What's your definition of a leaky ReLU function? This would set the gradient equal to epsilon for some data values that are greater than zero. – Consignor 9/1, 2017 at 13:42

@NickBecker My definition of Leaky ReLU is the one from Wikipedia (en.wikipedia.org/wiki/Rectifier_(neural_networks)#Leaky_ReLUs). That line returns an array of 0's and 1's. The 0's come from all the values of that that are smaller than epsilon, while the 1's come from all the remaining values greater than epsilon. In this case, I'm using epsilon = 0.1. – Dutybound 9/1, 2017 at 13:49

When I look at the piecewise function f(x) in that wikipedia section on Leaky ReLUs, I see a piecewise derivative of 1 when x > 0 and alpha otherwise. I could be missing something, though. – Consignor 9/1, 2017 at 13:52

@NickBecker That piecewise behaviour is what I generate in the 2nd line of relu_prime. I've already used the 0.01 value in epsilon. I saw in other posts that the value of epsilon can be variable, as long as it is "small". – Dutybound 9/1, 2017 at 13:55

gradients == 0 will be True for values of x greater than 0 but less than epsilon, though, making the derivative epsilon for x values greater than 0 but less than epsilon. Does that follow from the f(x) definition? – Consignor 9/1, 2017 at 13:57

Let us continue this discussion in chat. – Consignor 9/1, 2017 at 14:0

Your relu_prime function should be:

def relu_prime(data, epsilon=0.1):
    gradients = 1. * (data > 0)
    gradients[gradients == 0] = epsilon
    return gradients

Note the comparison of each value in the data matrix to 0, instead of epsilon. This follows from the standard definition of leaky ReLUs, which creates a piecewise gradient of 1 when x > 0 and epsilon otherwise.

I can't comment on whether leaky ReLUs are the best choice for the XOR problem, but this should resolve your gradient issue.

Consignor answered 9/1, 2017 at 14:16 Comment(1)

Now I get correct results most of the time. Along with what @ArnisShaykh answered and yours, I've now learned that the activation function election depends on the data values. – Dutybound 9/1, 2017 at 14:23

Short answer

Don't use ReLU with binary digits. It is designed to operate with much greater values. Also avoid using it when there is no negative values because it will basically mean that you are using a linear activation function which is not the best one. Best to use with Convolutional Neural Networks.

Long answer

Can't say if there is anything wrong with python code because i code in Java. But logic-wise, I think that using ReLU in this case is a bad decision. As we are predicting XOR there is a limited range to the values of your NN [0,1]. This is also the range of the sigmoid activation function. With ReLU you operate with values [0,infinity] which means there is an awful lot of values that you are never going to use since it is XOR. But the ReLU will still take this values into consideration and the error that you are going to get will increase. That is why you get correct answers about 50% of the time. In fact this value can be as low as 0% and as high as 99%. Moral of the story - when deciding which activation function to use try to match the range of the input values in your NN with the range of the activation function values.

Coelostat answered 9/1, 2017 at 13:46 Comment(2)

Thanks for pointing out that fact. I didn't thought about it. Makes total sense. – Dutybound 9/1, 2017 at 14:19

Glad that was helpfull. – Coelostat 9/1, 2017 at 16:18

Recommended topics

Hot tags