Backpropogation activation derivative

Asked 6/10, 2015 at 6:47 Answered 17/1, 2019 at 0:44

Solved backpropagation activation derivative delta

I've implemented backpropagation as explained in this video. https://class.coursera.org/ml-005/lecture/51

This seems to have worked successfully, passing gradient checking and allowing me to train on MNIST digits.

However, I've noticed most other explanations of backpropagation calculate the output delta as

d = (a - y) * f'(z) http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

whilst the video uses.

d = (a - y).

When I multiply my delta by the activation derivative (sigmoid derivative), I no longer end up with the same gradients as gradient checking (at least an order of magnitude in difference).

What allows Andrew Ng (video) to leave out the derivative of the activation for the output delta? And why does it work? Yet when adding the derivative, incorrect gradients are calculated?

EDIT

I have now tested with linear and sigmoid activation functions on the output, gradient checking only passes when I use Ng's delta equation (no sigmoid derivative) for both cases.

Conferral answered 6/10, 2015 at 6:47 Comment(1)

Please let me know if my question is not making sense. – Conferral 9/10, 2015 at 3:11

Found my answer here. The output delta does require multiplication by the derivative of the activation as in.

d = (a - y) * g'(z)

However, Ng is making use of the cross-entropy cost function which results in a delta that cancels the g'(z) resulting in the d = a - y calculation shown in the video. If a mean squared error cost function is used instead, the derivative of the activation function must be present.

Conferral answered 9/10, 2015 at 4:29 Comment(1)

I had the same doubt (I'm also following his videos), thanks for clarifying!! Although I also have another problem: I've checked my implementation with gradient checking and it's almost the same output. However, I'm getting pretty bad results (50% accuracy identifying digits). But if I remove the sigmoid derivative from calculation of inner-deltas, I get an accuracy of 90% (but obviously my gradients are no longer the same as the gradient checking). Do you have any idea of why this happens? – Antiseptic 20/1, 2018 at 21:22

When using Neural Networks it depends on the learning task how you need to design your network. A common approach for regression tasks is to use the tanh() activation functions for the input and all hidden layers and then the output layer uses an linear activation function (img taken from here)

I did' not find the source, but there was an theorem which states that using non-linear together with linear activaion functions allows you to better approximate the target functions. An example of using different activation functions can be found here and here.

The are many different kinds of acitvation function which can be used (img taken from here). If you look at the derivatives you can see that the derivative of the linar function equals to 1 which then will not be mentions anymore. This is also the case for Ng,s explanation, if you look at minute 12 in the video you see that he is talking about the outputlayer.

Concerning the Backpropagation-Algorithm

"When neuron is located in the output layer of the network, it is supplied with a desired response of its own. We may use e(n) = d(n) - y(n) to compute the error signal e(n) associated with this neuron; see Fig. 4.3. Having determined e(n), we find it a straightforward matter to compute the local gradient [...] When neuron is located in a hidden layer of the network, there is no specified desired response for that neuron. Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron is directly connected"

Haykin, Simon S., et al. Neural networks and learning machines. Vol. 3. Upper Saddle River: Pearson Education, 2009. p 159-164

Manipulate answered 6/10, 2015 at 7:47 Comment(5)

Do you mean the 2 minute mark? This would make sense if Ng (and myself) were using a linear activation at the output, but in the video, the output activation is being calculated as h = a = g(z) where g is the same sigmoid (logistic) function used for the input and hidden layers. – Conferral 6/10, 2015 at 11:27

The g(...) stands just for a neurons activation function (according to the general delta-rule definition), it is not said what type it is. If it was said before (I didn't watch the other videos), maybe it was said that a regression task sould be solved, so using sigmoid and linear output is a common approach. – Manipulate 6/10, 2015 at 14:58

He's definitely using sigmoid, you can see him mentioning it here: class.coursera.org/ml-005/lecture/47 (at 4min) And it is asked for in his programming assignment which I used to create my neural network. – Conferral 6/10, 2015 at 15:29

Your right about that, but the output-layer is treated differently than the other layers, I Iooked it up in Haykin09 (which I only have hat home) book and updated my answer accordingly. I you google for the book, you'll easily find a online version. – Manipulate 6/10, 2015 at 19:28

Thanks for finding this resource, however, that excerpt is in regards to a network with linear output (not sigmoid). The delta calculation for a sigmoid output is mentioned next. imgur.com/OTE1yFR – Conferral 7/10, 2015 at 2:54

Here is link with explanation of all the intuition and math behind Backpropagation.

Andrew Ng is using cross-entropy cost function defined with: $J(\theta) = \sum_i^m\sum_k^K\left[y\log(h_{\theta}(x))+\left(1-y\right) \left( 1- \log(h_{\theta}(x))\right)\right]$

When computing the partial derivative with respect to the θ parameter in the last layer what we get is:

$\left(y\log(h_{\theta}(x))+\left(1-y\right) \left( 1- \log(h_{\theta}(x))\right)\right)'$

$=\left(y\left( \log(h_{\theta}(x))\right)\right)'+\left(\left(1-y\right) \left( 1- \log(h_{\theta}(x))\right)\right)'$

$=\frac{y}{h_{\theta}(x)} (h_{\theta}(x))'+\frac{1-y}{1-h_{\theta}(x)} (1-h_{\theta}(x))'$

$=\frac{y}{h_{\theta}(x)} (\sigma(z^{(L)}))' (z^{(L)})'+\frac{1-y}{1-h_{\theta}(x)} (-(\sigma(z^{(L)}))'(z^{(L)})')$

$=\left(\frac{y}{h_{\theta}(x)}-\frac{(1-y)}{1-h_{\theta}(x)} \right)(\sigma(z^{(L)}))'(z^{(L)})'$

See at the end of this post for derivative of σ(z), which is replaced in:

$=\left(\frac{y}{h_{\theta}(x)}-\frac{(1-y)}{1-h_{\theta}(x)} \right)\sigma(z^{(L)}) (1-\sigma(z^{(L)}))(z^{(L)})'$

for last layer "L" we have, $a^{(L)} = \sigma(z^{(L)}) = h_\theta(x);$

$=\left(\frac{y}{h_{\theta}(x)}-\frac{(1-y)}{1-h_{\theta}(x)} \right)h_\theta(x)(1-h_\theta(x)) (z^{(4)})'$

And if we multiply:

$=\left(y(1-h_\theta(x))-(1-y)h_\theta(x) \right)(z^{(4)})'$

$=\left(y-h_\theta(x) \right)(z^{(4)})'$

For partial derivative of σ(z) what we get is :

$\sigma'(z) = \left(\frac{1}{1+e^{(-z)}}\right)' = \frac{1'e^{(-z)}-1(e^{(-z)})'}{(1+e^{(-z)})^{2}} = \frac{e^{(-z)}}{(1+e^{(-z)})^{2}}$

$=\frac{e^{(-z)}+1-1}{(1+e^{(-z)})^{2}}=\frac{1}{1+e^{(-z)}} - \left(\frac{1}{1+e^{(-z)}}\right)^{2} = \sigma(z) (1-\sigma(z))$

Marra answered 17/1, 2019 at 0:44 Comment(0)

Recommended topics

Hot tags