Derivative of a softmax function explanation [closed]
Asked Answered
K

2

11

I am trying to compute the derivative of the activation function for softmax. I found this : https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function nobody seems to give the proper derivation for how we would get the answers for i=j and i!= j. Could someone please explain this! I am confused with the derivatives when a summation is involved as in the denominator for the softmax activation function.

Kirkkirkcaldy answered 13/6, 2016 at 13:24 Comment(5)
I'm voting to close this question as off-topic because it has nothing to do with programmingMazurka
yes it does. There is a thing called the softmax function in neural networks and although one can use libraries, knowing the underlying math is an advantage. @MazurkaDrona
@Drona we have no less that 3 (!) dedicated SE sites for such non-programming ML questions, which are off-topic here; please see the intro and NOTE in stackoverflow.com/tags/machine-learning/infoMazurka
I’m voting to close this question because it is not about programming as defined in the help center but about ML theory and/or methodology - please see the note in stackoverflow.com/tags/neural-network/infoMazurka
@Drona and sincere thanks for the mini-lecture on softmax and libraries, but I think I got this #34969222Mazurka
T
17

The derivative of a sum is the sum of the derivatives, ie:

    d(f1 + f2 + f3 + f4)/dx = df1/dx + df2/dx + df3/dx + df4/dx

To derive the derivatives of p_j with respect to o_i we start with:

    d_i(p_j) = d_i(exp(o_j) / Sum_k(exp(o_k)))

I decided to use d_i for the derivative with respect to o_i to make this easier to read. Using the product rule we get:

     d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k)))

Looking at the first term, the derivative will be 0 if i != j, this can be represented with a delta function which I will call D_ij. This gives (for the first term):

    = D_ij * exp(o_j) / Sum_k(exp(o_k))

Which is just our original function multiplied by D_ij

    = D_ij * p_j

For the second term, when we derive each element of the sum individually, the only non-zero term will be when i = k, this gives us (not forgetting the power rule because the sum is in the denominator)

    = -exp(o_j) * Sum_k(d_i(exp(o_k)) / Sum_k(exp(o_k))^2
    = -exp(o_j) * exp(o_i) / Sum_k(exp(o_k))^2
    = -(exp(o_j) / Sum_k(exp(o_k))) * (exp(o_j) / Sum_k(exp(o_k)))
    = -p_j * p_i

Putting the two together we get the surprisingly simple formula:

    D_ij * p_j - p_j * p_i

If you really want we can split it into i = j and i != j cases:

    i = j: D_ii * p_i - p_i * p_i = p_i - p_i * p_i = p_i * (1 - p_i)

    i != j: D_ij * p_i - p_i * p_j = -p_i * p_j

Which is our answer.

Tarango answered 13/6, 2016 at 13:55 Comment(11)
thank you so much! This is so clear. I couldn't have asked for a better explanation! :) I am glad I understand the derivation completely now. I am going to refer this to the unanswered one on math.stack exchange!Kirkkirkcaldy
@Tarango shouldn't your third expression be d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k))) ? Missing exp before the last o_kThieve
@BenjaminCrouzier Thanks, fixed itTarango
>Looking at the first term, the derivative will be 0 if i != j Why is this the case? The output o_i (i.e , a particular node of softmax) depends on all the values from the incoming layer. Wont this mean that if i!=j , the values will be different from i=k but not 0 ?see eli.thegreenplace.net/2016/…Aretha
@Aretha Each variable is indenpendent, so the partial derivative says exactly that this is 0. When taking the partial derivative with respect to some variable, you treat every other variable as constant in the process.Tarango
@Aretha The derivative doesn't look at what the output is, it looks at how that output changes when you vary just one variable, this is why all other variables are treated as constant. I hope this clears this up a bit.Tarango
please see my question that i have posted in detail math.stackexchange.com/questions/2843505/… My derivative came out to be non 0 for the case that you have 0 for. Am I mistaken ?Aretha
@Aretha First, In your question you link to, you incorrectly say that you add up the elements of the Jacobian to get the 'final' derivative. This is incorrect, think instead of the Jacobian as being the derivative and not an intermediate step that leads to the derivative.Tarango
@Aretha in my solution the i and j refer to the elements of the Jacobian matrix. you seem to think that the 'thing' that goes to 0 is the derivative, but it's just one part of the partial derivative. You wrote out each derivative manually (for 4 inputs) whereas I treated the general case.Tarango
@Aretha The thing that went to 0 was the subexpression d_i(exp(o_j)) which is part of the subexpression d_i(exp(o_j)) / Sum_k(exp(o_k)). Look carefully at the parentheses and you will see that this is the derivative of exp(o_j)` with respect to o_i divided by Sum over k of exp(o_k). The derivative of Sum_k(exp(o_k)) with respect to o_i is taken care of in the second part of the product rule expansion. Does this help clear things up?Tarango
It does. I think a detailed answer to my question would be of great help to others too :)Aretha
T
8

For what it's worth, here is my derivation based on SirGuy answer: (Feel free to point errors if you find any).

enter image description here

Thieve answered 31/10, 2017 at 14:48 Comment(2)
thanks very much for this! I have just one doubt: why does Σ_k ( ( d e^{o_k} ) / do_i ) evaluate to e^{o_i} from step 4 to 5? I'd be very grateful for any insights you can offer on that question.Monopoly
@Monopoly Good question. Think about all the terms of that sum one by one and see what happens to each term. You see that you have two cases: When i = k, the term is d/do_i e^o_i which is e^o_i. When i != k, you get a bunch of zeroes.Thieve

© 2022 - 2024 — McMap. All rights reserved.