Derivative of a softmax function explanation [closed]

Asked 13/6, 2016 at 13:24 Answered 31/10, 2017 at 14:48

Solved neural-network derivative calculus softmax

I am trying to compute the derivative of the activation function for softmax. I found this : https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function nobody seems to give the proper derivation for how we would get the answers for i=j and i!= j. Could someone please explain this! I am confused with the derivatives when a summation is involved as in the denominator for the softmax activation function.

Kirkkirkcaldy answered 13/6, 2016 at 13:24 Comment(5)

I'm voting to close this question as off-topic because it has nothing to do with programming – Mazurka 16/3, 2018 at 18:6

yes it does. There is a thing called the softmax function in neural networks and although one can use libraries, knowing the underlying math is an advantage. @Mazurka – Drona 9/12, 2021 at 8:34

@Drona we have no less that 3 (!) dedicated SE sites for such non-programming ML questions, which are off-topic here; please see the intro and NOTE in stackoverflow.com/tags/machine-learning/info – Mazurka 9/12, 2021 at 9:32

I’m voting to close this question because it is not about programming as defined in the help center but about ML theory and/or methodology - please see the note in stackoverflow.com/tags/neural-network/info – Mazurka 9/12, 2021 at 9:33

@Drona and sincere thanks for the mini-lecture on softmax and libraries, but I think I got this #34969222 – Mazurka 9/12, 2021 at 9:40

The derivative of a sum is the sum of the derivatives, ie:

    d(f1 + f2 + f3 + f4)/dx = df1/dx + df2/dx + df3/dx + df4/dx

To derive the derivatives of p_j with respect to o_i we start with:

    d_i(p_j) = d_i(exp(o_j) / Sum_k(exp(o_k)))

I decided to use d_i for the derivative with respect to o_i to make this easier to read. Using the product rule we get:

     d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k)))

Looking at the first term, the derivative will be 0 if i != j, this can be represented with a delta function which I will call D_ij. This gives (for the first term):

    = D_ij * exp(o_j) / Sum_k(exp(o_k))

Which is just our original function multiplied by D_ij

    = D_ij * p_j

For the second term, when we derive each element of the sum individually, the only non-zero term will be when i = k, this gives us (not forgetting the power rule because the sum is in the denominator)

    = -exp(o_j) * Sum_k(d_i(exp(o_k)) / Sum_k(exp(o_k))^2
    = -exp(o_j) * exp(o_i) / Sum_k(exp(o_k))^2
    = -(exp(o_j) / Sum_k(exp(o_k))) * (exp(o_j) / Sum_k(exp(o_k)))
    = -p_j * p_i

Putting the two together we get the surprisingly simple formula:

    D_ij * p_j - p_j * p_i

If you really want we can split it into i = j and i != j cases:

    i = j: D_ii * p_i - p_i * p_i = p_i - p_i * p_i = p_i * (1 - p_i)

    i != j: D_ij * p_i - p_i * p_j = -p_i * p_j

Which is our answer.

Tarango answered 13/6, 2016 at 13:55 Comment(11)

thank you so much! This is so clear. I couldn't have asked for a better explanation! :) I am glad I understand the derivation completely now. I am going to refer this to the unanswered one on math.stack exchange! – Kirkkirkcaldy 13/6, 2016 at 14:9

@Tarango shouldn't your third expression be d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k))) ? Missing exp before the last o_k – Thieve 31/10, 2017 at 10:29

@BenjaminCrouzier Thanks, fixed it – Tarango 31/10, 2017 at 12:27

>Looking at the first term, the derivative will be 0 if i != j Why is this the case? The output o_i (i.e , a particular node of softmax) depends on all the values from the incoming layer. Wont this mean that if i!=j , the values will be different from i=k but not 0 ?see eli.thegreenplace.net/2016/… – Aretha 6/7, 2018 at 10:17

@Aretha Each variable is indenpendent, so the partial derivative says exactly that this is 0. When taking the partial derivative with respect to some variable, you treat every other variable as constant in the process. – Tarango 8/7, 2018 at 3:43

@Aretha The derivative doesn't look at what the output is, it looks at how that output changes when you vary just one variable, this is why all other variables are treated as constant. I hope this clears this up a bit. – Tarango 8/7, 2018 at 3:44

please see my question that i have posted in detail math.stackexchange.com/questions/2843505/… My derivative came out to be non 0 for the case that you have 0 for. Am I mistaken ? – Aretha 8/7, 2018 at 7:58

@Aretha First, In your question you link to, you incorrectly say that you add up the elements of the Jacobian to get the 'final' derivative. This is incorrect, think instead of the Jacobian as being the derivative and not an intermediate step that leads to the derivative. – Tarango 9/7, 2018 at 18:18

@Aretha in my solution the i and j refer to the elements of the Jacobian matrix. you seem to think that the 'thing' that goes to 0 is the derivative, but it's just one part of the partial derivative. You wrote out each derivative manually (for 4 inputs) whereas I treated the general case. – Tarango 9/7, 2018 at 18:28

@Aretha The thing that went to 0 was the subexpression d_i(exp(o_j)) which is part of the subexpression d_i(exp(o_j)) / Sum_k(exp(o_k)). Look carefully at the parentheses and you will see that this is the derivative of exp(o_j)` with respect to o_i divided by Sum over k of exp(o_k). The derivative of Sum_k(exp(o_k)) with respect to o_i is taken care of in the second part of the product rule expansion. Does this help clear things up? – Tarango 9/7, 2018 at 18:28

It does. I think a detailed answer to my question would be of great help to others too :) – Aretha 21/7, 2018 at 13:54

For what it's worth, here is my derivation based on SirGuy answer: (Feel free to point errors if you find any).

Thieve answered 31/10, 2017 at 14:48 Comment(2)

thanks very much for this! I have just one doubt: why does Σ_k ( ( d e^{o_k} ) / do_i ) evaluate to e^{o_i} from step 4 to 5? I'd be very grateful for any insights you can offer on that question. – Monopoly 31/12, 2017 at 5:10

@Monopoly Good question. Think about all the terms of that sum one by one and see what happens to each term. You see that you have two cases: When i = k, the term is d/do_i e^o_i which is e^o_i. When i != k, you get a bunch of zeroes. – Thieve 31/12, 2017 at 11:24

Recommended topics

Hot tags