Do convolutional neural networks suffer from the vanishing gradient?

Asked 9/3, 2015 at 23:30 Answered 14/12, 2020 at 20:49

Solved machine-learning neural-network classification conv-neural-network

I think I read somewhere that convolutional neural networks do not suffer from the vanishing gradient problem as much as standard sigmoid neural networks with increasing number of layers. But I have not been able to find a 'why'.

Does it truly not suffer from the problem or am I wrong and it depends on the activation function? [I have been using Rectified Linear Units, so I have never tested the Sigmoid Units for Convolutional Neural Networks]

Weise answered 9/3, 2015 at 23:30 Comment(0)

Convolutional neural networks (like standard sigmoid neural networks) do suffer from the vanishing gradient problem. The most recommended approaches to overcome the vanishing gradient problem are:

Layerwise pre-training
Choice of the activation function

You may see that the state-of-the-art deep neural network for computer vision problem (like the ImageNet winners) have used convolutional layers as the first few layers of the their network, but it is not the key for solving the vanishing gradient. The key is usually training the network greedily layer by layer. Using convolutional layers have several other important benefits of course. Especially in vision problems when the input size is large (the pixels of an image), using convolutional layers for the first layers are recommended because they have fewer parameters than fully-connected layers and you don't end up with billions of parameters for the first layer (which will make your network prone to overfitting).

However, it has been shown (like this paper) for several tasks that using Rectified linear units alleviates the problem of vanishing gradients (as oppose to conventional sigmoid functions).

Electrodynamics answered 10/3, 2015 at 0:29 Comment(4)

Yes I was reading somewhere else that Rectified Linear Units are free from the vanishing gradient problem. I know that autoencoders and boltzmann machines are trained in a greedy layerwise manner. Are the same ever done to convolutional neural networks? – Weise 10/3, 2015 at 1:45

ReLUs are not totally free from the vanishing gradient problem, but they have less of that problem. It is possible to perform greedy layerwise pre-training on Convolutional networks too. It may be unsupervised like autoencoders or supervised when you connect the layer to the outputs. I believe in this paper they did supervised pre-training: cs.toronto.edu/~fritz/absps/imagenet.pdf – Electrodynamics 10/3, 2015 at 20:36

I read the paper (in my last comment) again. It was not clear that they used greedy layerwise pre-training. They just say pre-training. I do not have other references for layerwise training on convolutional networks at the moment, but it is possible to do that. – Electrodynamics 10/3, 2015 at 22:13

@Weise Here is an excellent explanation of why other activation functions, such as the sigmoid function, cause vanishing gradients. There's just the right amount of math in there to make you understand the true reason. – Kolk 23/10, 2015 at 22:35

Recent advances had alleviate the effects of vanishing gradients in deep neural networks. Among contributing advances include:

Usage of GPU for training deep neural networks
Usage of better activation functions. (At this point rectified linear units (ReLU) seems to work the best.)

With these advances, deep neural networks can be trained even without layerwise pretraining.

Source: http://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/

Turrell answered 8/1, 2016 at 4:3 Comment(4)

this is irrelevant to the problem: "Usage of GPU for training deep neural networks" – Yila 30/5, 2017 at 11:18

if you train the CNN using the GPU, then you'd be less affected by the vanishing gradient. Hope its clear – Turrell 1/6, 2017 at 4:5

well actually, I think a more proper way to say is that by using GPU you can afford using smaller learning rate (computing time won't be an issue), and that somehow reduce the risk of vanishing. – Apocarp 9/12, 2018 at 4:51

@BsHe this makes more sense than what dnth said – Ephebe 24/5, 2019 at 2:36

we do not use Sigmoid and Tanh as Activation functions which causes vanishing Gradient Problems. Mostly nowadays we use RELU based activation functions in training a Deep Neural Network Model to avoid such complications and improve the accuracy.

It’s because the gradient or slope of RELU activation if it’s over 0, is 1. Sigmoid derivative has a maximum slope of .25, which means that during the backward pass, you are multiplying gradients with values less than 1, and if you have more and more layers, you are multiplying it with values less than 1, making gradients smaller and smaller. RELU activation solves this by having a gradient slope of 1, so during backpropagation, there isn’t gradients passed back that are progressively getting smaller and smaller. but instead they are staying the same, which is how RELU solves the vanishing gradient problem.

One thing to note about RELU however is that if you have a value less than 0, that neuron is dead, and the gradient passed back is 0, meaning that during backpropagation, you will have 0 gradient being passed back if you had a value less than 0.

An alternative is Leaky RELU, which gives some gradient for values less than 0.

Meagan answered 22/3, 2020 at 20:6 Comment(0)

The first answer is from 2015 and a bit of age.

Today, CNNs typically also use batchnorm - while there is some debate why this helps: the inventors mention covariate shift: https://arxiv.org/abs/1502.03167 There are other theories like smoothing the loss landscape: https://arxiv.org/abs/1805.11604

Either way, it is a method that helps to deal significantly with vanishing/exploding gradient problem that is also relevant for CNNs. In CNNs you also apply the chain rule to get gradients. That is the update of the first layer is proportional to the product of N numbers, where N is the number of inputs. It is very likely that this number is either relatively big or small compared to the update of the last layer. This might be seen by looking at the variance of a product of random variables that quickly grows the more variables are being multiplied: https://stats.stackexchange.com/questions/52646/variance-of-product-of-multiple-random-variables

For recurrent networks that have long sequences of inputs, ie. of length L, the situation is often worse than for CNN, since there the product consists of L numbers. Often the sequence length L in a RNN is much larger than the number of layers N in a CNN.

Mimi answered 14/12, 2020 at 20:49 Comment(0)

Recommended topics

Hot tags