Binary classification with Softmax
Asked Answered
S

1

19

I am training a binary classifier using Sigmoid activation function with Binary crossentropy which gives good accuracy around 98%.
The same when I train using softmax with categorical_crossentropy gives very low accuracy (< 40%).
I am passing the targets for binary_crossentropy as list of 0s and 1s eg; [0,1,1,1,0].

Any idea why this is happening?

This is the model I am using for the second classifier: enter image description here

Spate answered 21/8, 2017 at 9:38 Comment(1)
Could you please show us the code you used? Maybe the answer lies somewhere hidden on your description. My guess would be there are officially more than 2 classes in your second classifier, as 40% accuracy is even worst than a random binary classifier.Pithecanthropus
P
37

Right now, your second model always answers "Class 0" as it can choose between only one class (number of outputs of your last layer).

As you have two classes, you need to compute the softmax + categorical_crossentropy on two outputs to pick the most probable one.

Hence, your last layer should be:

model.add(Dense(2, activation='softmax')
model.compile(...)

Your sigmoid + binary_crossentropy model, which computes the probability of "Class 0" being True by analyzing just a single output number, is already correct.

EDIT: Here is a small explanation about the Sigmoid function

Sigmoid can be viewed as a mapping between the real numbers space and a probability space.

Sigmoid Function

Notice that:

Sigmoid(-infinity) = 0   
Sigmoid(0) = 0.5   
Sigmoid(+infinity) = 1   

So if the real number, output of your network, is very low, the sigmoid will decide the probability of "Class 0" is close to 0, and decide "Class 1"
On the contrary, if the output of your network is very high, the sigmoid will decide the probability of "Class 0" is close to 1, and decide "Class 0"

Its decision is similar to deciding the Class only by looking at the sign of your output. However, this would not allow your model to learn! Indeed, the gradient of this binary loss is null nearly everywhere, making impossible for your model to learn from error, as it is not quantified properly.

That's why sigmoid and "binary_crossentropy" are used:
They are a surrogate to the binary loss, which has nice smooth properties, and enables learning.

Also, please find more info about Softmax Function and Cross Entropy

Pithecanthropus answered 21/8, 2017 at 10:11 Comment(3)
I now understand the logic. But how did sigmoid work with just one output?Spate
@AKSHAYAAVAIDYANATHAN I just edited my post, I hope this helps!Pithecanthropus
And I also realized output should be in the format-> [[0,1], [1,0]] for the categorical crossentropy rather than just list of 1s and 0sSpate

© 2022 - 2024 — McMap. All rights reserved.