Why does almost every Activation Function Saturate at Negative Input Values in a Neural Network

Asked 27/2, 2020 at 15:34 Answered 16/3, 2023 at 5:21

Solved keras deep-learning neural-network activation-function

This may be a very basic/trivial question.

For Negative Inputs,

Output of ReLu Activation Function is Zero
Output of Sigmoid Activation Function is Zero
Output of Tanh Activation Function is -1

Below Mentioned are my questions:

Why is it that all of the above Activation Functions Saturated for Negative Input Values.

Is there any Activation Function if we want to predict a Negative Target Value.

Thank you.

Spiccato answered 27/2, 2020 at 15:34 Comment(3)

Isn't tanh and sigmoid symmetrical to 0? – Shirleeshirleen 27/2, 2020 at 15:38

Only Tanh is symmetrical to 0. Sigmoid is symmetrical to 0.5. (But regarding the "inputs", they're symmetrical to 0) – Mythical 27/2, 2020 at 16:47

I would like to know what exactly you mean by the second part of the question i.e. predicting a negative target value – Lather 27/2, 2020 at 16:52

True - ReLU is designed to result in zero for negative values. (It can be dangerous with big learning rates, bad initialization or with very few units - all neurons can get stuck in zero and the model freezes)
False - Sigmoid results in zero for "very negative" inputs, not for "negative" inputs. If your inputs are between -3 and +3, you will see a very pleasant result between 0 and 1.
False - The same comment as Sigmoid. If your inputs are between -2 and 2, you will see nice results between -1 and 1.

So, the saturation problem only exists for inputs whose absolute values are too big.

By definition, the outputs are:

ReLU: 0 < y < inf (with center in 0)
Sigmoid: 0 < y < 1 (with center in 0.5)
TanH: -1 < y < 1 (with center in 0)

You might want to use a BatchNormalization layer before these activations to avoid having big values and avoid saturation.

For predicting negative outputs, tanh is the only of the three that is capable of doing that.

You could invent a negative sigmoid, though, it's pretty easy:

def neg_sigmoid(x):
    return -keras.backend.sigmoid(x)

#use the layer:
Activation(neg_sigmoid)

Marshallmarshallese answered 27/2, 2020 at 16:46 Comment(1)

Thank you for the mention of BatchNormalization and for Negative Output using Sigmoid. – Spiccato 28/2, 2020 at 14:52

In short, negative/positive doesn't matter for these activation functions.

Sigmoid and tanh is both saturated for positive and negative values. As stated in the comments, they are symmetrical to input 0. For relu, it does only saturate for negative values, but I'll explain why it doens't matter in the next question.
The answer is an activation function doesn't need to 'predict' a negative value. The point of the activation function is not to give an equation to predict your final value, but to give a non-linearity to your neural network in the middle layers. You then use some appropriate function at the last layer to get the wanted output values. ex) softmax for classification, just linear for regression.
So because these activation functions are in the middle, it really doesn't matter if the activation function only outputs positive values even if your 'wanted' values are negative, since the model will make the weights for the next layes negative.(hence the term 'wanted values are negative' doesn't mean anything)

So, Relu being saturated on the negative side is no different from it being saturated on the positive side. There are activation functions that doesn't saturated such as leaky Relu, so you may want to check it out. But the point positive/negative for activation functions doesn't matter.

Shirleeshirleen answered 27/2, 2020 at 17:32 Comment(2)

Thank you, your second point has a very good explanation however, you said ReLu Saturates on Positive Side as well, which is not the case. – Spiccato 28/2, 2020 at 14:54

@Spiccato Umm.. I don' think I said that. – Shirleeshirleen 28/2, 2020 at 15:8

The key idea behind introducing the ReLu activation function was to address the issue of vanishing gradients in deeper networks. However, for different initialization, when the weights go above 1, it could lead to explosion of gradient values and cause the network to saturate. And the key idea behind ReLu was to introduce sparsity into the network. In a easy way we can say that it just prunes the connections deemed unimportant ( that is -ve weights ). Yup, here we have to be careful in the distribution of weights we initialize or the network can end up too sparse and unable to learn more information.
Sigmoid - The key problem with sigmoid for gradient based learning rules is that the derivative of sigmoid leads to a function that goes to 0 for very large inputs. Thus causing vanishing gradients, and also sigmoid doesn't cause a problem with negative values but instead, for large positive input values.
Tanh - The idea behind tanh is to not have sparsity which is enforced by ReLu and utilize complex network dynamics for learning similar to the sigmoid function. Tanh in a simpler way, tries to use the entire network's capability to learn and addresses the vanishing gradient problem similar to ReLu. And having a negative factor in the network acts as a dynamic regularizer (negative weights are strongly pulled to -1 and weights near 0 go towards 0) and is useful for binary classification or fewer class classification problems.

This link has some good information that would be helpful for you.

Lather answered 27/2, 2020 at 17:11 Comment(1)

Thank you for the link, it is very useful – Spiccato 28/2, 2020 at 14:49

Not necessarily, it is possible to modify the ReLU activation function to allow it to pass negative values and damp positive values. One way to achieve this is by using a variant of the ReLU function called the leaky ReLU.

In the leaky ReLU, instead of setting negative values to zero, we introduce a small negative slope, typically a small constant like 0.01 or 0.001. This allows some information to flow through the neurons that have negative values, which can help improve the learning process in some cases.

The mathematical expression for the leaky ReLU is:

f(x) = max(ax, x)

where a is a small positive constant, and x is the input to the neuron. When x is negative, the slope of the function is a, and when x is positive, the function behaves like the regular ReLU.

Another variant of the ReLU function that can be used to dampen positive values is the exponential linear unit (ELU). In the ELU, the function is defined as:

f(x) = { x, if x < 0; alpha(exp(x) - 1), if x >= 0}

where alpha is a small positive constant, and exp() is the exponential function. The ELU function has negative values for negative inputs, and it saturates at a negative value for large negative inputs. For positive inputs, it dampens the output by applying a non-zero slope that is smaller than 1.

Tan answered 16/3, 2023 at 5:21 Comment(0)

Recommended topics

Hot tags