Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?

Asked 7/2, 2017 at 3:34 Answered 16/3, 2021 at 14:7

Solved machine-learning keras neural-network deep-learning conv-neural-network

218

I'm trying to train a CNN to categorize text by topic. When I use binary cross-entropy I get ~80% accuracy, with categorical cross-entropy I get ~50% accuracy.

I don't understand why this is. It's a multiclass problem, doesn't that mean that I have to use categorical cross-entropy and that the results with binary cross-entropy are meaningless?

model.add(embedding_layer)
model.add(Dropout(0.25))
# convolution layers
model.add(Conv1D(nb_filter=32,
                    filter_length=4,
                    border_mode='valid',
                    activation='relu'))
model.add(MaxPooling1D(pool_length=2))
# dense layers
model.add(Flatten())
model.add(Dense(256))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# output layer
model.add(Dense(len(class_id_index)))
model.add(Activation('softmax'))

Then I compile it either it like this using categorical_crossentropy as the loss function:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Intuitively it makes sense why I'd want to use categorical cross-entropy, I don't understand why I get good results with binary, and poor results with categorical.

Powder answered 7/2, 2017 at 3:34 Comment(10)

If it is a multiclass problem, you have to use categorical_crossentropy. Also labels need to converted into the categorical format. See to_categorical to do this. Also see definitions of categorical and binary crossentropies here. – Slumber 7/2, 2017 at 3:42

My labels are categorical, created using to_categorical (one hot vectors for each class). Does that mean the ~80% accuracy from binary crossentropy is just a bogus number? – Powder 7/2, 2017 at 3:45

I think so. If you use categorical labels i.e. one hot vectors, then you want categorical_crossentropy. If you have two classes, they will be represented as 0, 1 in binary labels and 10, 01 in categorical label format. – Slumber 7/2, 2017 at 3:54

Intuitively it makes sense why I'd want to use categorical_crossentropy, I don't understand why I get good results with binary, and poor results with categorical. – Powder 7/2, 2017 at 4:2

I think he just compares to the first number in the vector and ignores the rest. – Phan 7/2, 2017 at 7:11

I am observing a similar situation, If I use binary_crossentropy I get better results (also in terms of loss), very interesting. – Ketene 16/3, 2017 at 13:32

My data is imbalanced (one of the classes is more dense), do you also have similar structure in training data? – Ketene 16/3, 2017 at 13:35

I did, it's possible this was a contributing factor although I have since moved away from a neural net for this data (for other reasons) so I haven't looked into this much more – Powder 16/3, 2017 at 14:49

@ParagS.Chandakkar . The representation will be 0, 1 for binary classificaton and [[0, 0], [0, 1]] for a categorical classification . It also highly depends on how you design the final softmax layer. Dense(1, activation='softmax') should allow for 0,1. Dense(2, activation='softmax') requires [[0,0],[0,1]] – Corvine 14/12, 2018 at 13:7

@NilavBaranGhosh The representation will be [[1, 0], [0, 1]] for a categorical classification involving two classes (not [[0, 0], [0, 1]] like you mention). Dense(1, activation='softmax') for binary classification is simply wrong. Remember softmax output is a probability distribution that sums to one. If you want to have only one output neuron with binary classification, use sigmoid with binary cross-entropy. – Slumber 14/12, 2018 at 19:0

268

The reason for this apparent performance discrepancy between categorical & binary cross entropy is what user xtof54 has already reported in his answer below, i.e.:

the accuracy computed with the Keras method evaluate is just plain wrong when using binary_crossentropy with more than 2 labels

I would like to elaborate more on this, demonstrate the actual underlying issue, explain it, and offer a remedy.

This behavior is not a bug; the underlying reason is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation. In other words, while your first compilation option

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

is valid, your second one:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).

Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.

Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # WRONG way

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2,  # only 2 epochs, for demonstration purposes
          verbose=1,
          validation_data=(x_test, y_test))

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.9975801164627075

# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001

score[1]==acc
# False

To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.98580000000000001

# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001

score[1]==acc
# True

System setup:

Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4

UPDATE: After my post, I discovered that this issue had already been identified in this answer.

Taker answered 4/9, 2017 at 13:34 Comment(0)

It all depends on the type of classification problem you are dealing with. There are three main categories

binary classification (two target classes),
multi-class classification (more than two exclusive targets),
multi-label classification (more than two non exclusive targets), in which multiple target classes can be on at the same time.

In the first case, binary cross-entropy should be used and targets should be encoded as one-hot vectors.

In the second case, categorical cross-entropy should be used and targets should be encoded as one-hot vectors.

In the last case, binary cross-entropy should be used and targets should be encoded as one-hot vectors. Each output neuron (or unit) is considered as a separate random binary variable, and the loss for the entire vector of outputs is the product of the loss of single binary variables. Therefore it is the product of binary cross-entropy for each single output unit.

The binary cross-entropy is defined as

and categorical cross-entropy is defined as

where c is the index running over the number of classes C.

Sidonnie answered 8/3, 2018 at 14:34 Comment(10)

Are you sure that the binary and categorical cross-entropies are defined as in the formulas in this answer? – Quinque 6/1, 2020 at 13:39

@nbro, actually, the c index is redundant in the binary cross-entropy formula, it doesn't need to be there (since there are only 2 classes and the probability of each class is embedded in y(x). Otherwise those formulas should be correct, but notice those are not losses, those are likelihoods. If you want the loss you have to take the log of these. – Sidonnie 7/1, 2020 at 13:57

@Sidonnie You should explain why the formula for the categorical cross-entropy apparently looks simpler than the formula for the binary cross-entropy. You should also explain what C, c and all other symbols there are. (Yes, I am familiar with the log-trick). Furthermore, in all cases, you say that the targets should be one-hot encoded, but you say it for each case, rather than saying "for all cases, the targets need to be hot-encoded". Maybe you should spend words explaining your explanation. – Quinque 7/1, 2020 at 14:1

@Quinque Why should I do explain why one formula looks simpler than the other? How does knowing this help one's understanding of the answer? Why would it be a problem that I repeat that the target should be one-hot encoded? This is not a review of an article or a paper. I'm not sure why you care about wording. As long as the explanation makes sense. I will explain the C and c – Sidonnie 8/1, 2020 at 11:37

Since you decided to give a general tutorial-type answer on the relevant ML notions instead of addressing the specific coding question as asked, it would be arguably useful to point out that, in the binary case, there is the option not to one-hot encode but to keep the labels as single-digits and use sigmoid activation in the last layer. Just repeating the one-hot encoding requirement in each single bullet is indeed redundant and not good practice. – Taker 18/9, 2020 at 7:58

I choose this path because I felt like the questions didn't enough details as to which of those 3 problems it is attempting to solve. Yes you're right, I forgot to mention the activation function for all of these cases but as you point out for the binary case one should use the sigmoid function. But that's because a softmax function on two output units is mathematically equivalent to the sigmoid's output. – Sidonnie 22/9, 2020 at 13:23

What should keep me from using categorical cross-entropy in binary classification? After all, binary classification is just a special case of multi-class classification, so that should work as well, right? – Vector 20/11, 2020 at 9:14

Yes you're right, and the formula for binary ce is in fact a special case of the categorical ce, because if you only have two classes, given that the probabilities sum up to 1, the probability of one class is exactly 1 - the probability of the other one. Categorical ce is used when you have multiple output neurons, so the (1-y) is 'embedded' in the y(x) of the other neurons, but in binary classification people usually use only 1 output neuron, in which case you need to have the (1-y) term specifically in your loss. Does that make sense? You can find more detailed articles about this – Sidonnie 27/11, 2020 at 16:18

Nicely explained. However in case of multi-label classification, the final loss is the SUM (or average) of each of the single binary-CE losses...not their product. – Overabundance 9/3, 2021 at 12:22

You said all three cases should be encoded as "one-hot vector". Is it right? – Augusto 11/7, 2022 at 23:26

I came across an "inverted" issue — I was getting good results with categorical_crossentropy (with 2 classes) and poor with binary_crossentropy. It seems that problem was with wrong activation function. The correct settings were:

for binary_crossentropy: sigmoid activation, scalar target
for categorical_crossentropy: softmax activation, one-hot encoded target

Reluctivity answered 1/8, 2017 at 10:43 Comment(7)

Are you sure about scalar target for binary_crossentropy. It looks like you should use "many-hot" encoded target (e.g. [0 1 0 0 1 1]). – Hadley 15/9, 2017 at 2:19

Sure. See keras.io/losses/#usage-of-loss-functions, it says: "when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample)" – Reluctivity 15/9, 2017 at 10:56

But we are speaking about binary_crossentropy - not categorical_crossentropy. – Hadley 15/9, 2017 at 21:51

This answer seems to be inconsistent with https://mcmap.net/q/125297/-why-binary_crossentropy-and-categorical_crossentropy-give-different-performances-for-the-same-problem, where the author says that the targets should be one-hot encoded, while, in your answer, you suggest they should be scalars. You should clarify this. – Quinque 6/1, 2020 at 13:43

@AlexanderSvetkin, the target should be one-hot encoded everywhere, not just when using categorical cross-entropy – Sidonnie 7/1, 2020 at 13:54

@Sidonnie Not necessarily. See https://mcmap.net/q/128248/-what-is-the-difference-between-a-sigmoid-followed-by-the-cross-entropy-and-sigmoid_cross_entropy_with_logits-in-tensorflow. – Quinque 19/1, 2020 at 15:35

@Quinque Sorry I meant binary (0 or 1) and not any scalar as mentioned in this answer – Sidonnie 22/1, 2020 at 11:36

It's really interesting case. Actually in your setup the following statement is true:

binary_crossentropy = len(class_id_index) * categorical_crossentropy

This means that up to a constant multiplication factor your losses are equivalent. The weird behaviour that you are observing during a training phase might be an example of a following phenomenon:

At the beginning the most frequent class is dominating the loss - so network is learning to predict mostly this class for every example.
After it learnt the most frequent pattern it starts discriminating among less frequent classes. But when you are using adam - the learning rate has a much smaller value than it had at the beginning of training (it's because of the nature of this optimizer). It makes training slower and prevents your network from e.g. leaving a poor local minimum less possible.

That's why this constant factor might help in case of binary_crossentropy. After many epochs - the learning rate value is greater than in categorical_crossentropy case. I usually restart training (and learning phase) a few times when I notice such behaviour or/and adjusting a class weights using the following pattern:

class_weight = 1 / class_frequency

This makes loss from a less frequent classes balancing the influence of a dominant class loss at the beginning of a training and in a further part of an optimization process.

EDIT:

Actually - I checked that even though in case of maths:

binary_crossentropy = len(class_id_index) * categorical_crossentropy

should hold - in case of keras it's not true, because keras is automatically normalizing all outputs to sum up to 1. This is the actual reason behind this weird behaviour as in case of multiclassification such normalization harms a training.

Mckenzie answered 7/2, 2017 at 19:59 Comment(2)

This is a very plausible explanation. But I'm not sure it's really the main reason. Because I've also observed in several of my students work this weird behavior when applying binary-X-ent instead of cat-X-ent (which is a mistake). And this is true even when training for only 2 epochs ! Using class_weight with inverse class priors did not help. May be a rigorous tuning of the learning rate would help, but the default values seem to favour bin-X-ent. I think this question deserves more investigations... – Singularity 9/6, 2017 at 13:26

Wait, no sorry, I don't get your update: the softmax always make the outputs sum to 1, so we don't care about that ? And why would this harm training, as long as we only have a single gold class that is correct per example ? – Singularity 10/6, 2017 at 21:19

After commenting @Marcin answer, I have more carefully checked one of my students code where I found the same weird behavior, even after only 2 epochs ! (So @Marcin's explanation was not very likely in my case).

And I found that the answer is actually very simple: the accuracy computed with the Keras method evaluate is just plain wrong when using binary_crossentropy with more than 2 labels. You can check that by recomputing the accuracy yourself (first call the Keras method "predict" and then compute the number of correct answers returned by predict): you get the true accuracy, which is much lower than the Keras "evaluate" one.

Singularity answered 12/6, 2017 at 12:2 Comment(1)

I saw similar behavior on the first iteration as well. – Cutright 6/8, 2017 at 10:48

a simple example under a multi-class setting to illustrate

suppose you have 4 classes (onehot encoded) and below is just one prediction

true_label = [0,1,0,0] predicted_label = [0,0,1,0]

when using categorical_crossentropy, the accuracy is just 0 , it only cares about if you get the concerned class right.

however when using binary_crossentropy, the accuracy is calculated for all classes, it would be 50% for this prediction. and the final result will be the mean of the individual accuracies for both cases.

it is recommended to use categorical_crossentropy for multi-class(classes are mutually exclusive) problem but binary_crossentropy for multi-label problem.

Inutility answered 29/12, 2018 at 9:13 Comment(0)

As it is a multi-class problem, you have to use the categorical_crossentropy, the binary cross entropy will produce bogus results, most likely will only evaluate the first two classes only.

50% for a multi-class problem can be quite good, depending on the number of classes. If you have n classes, then 100/n is the minimum performance you can get by outputting a random class.

Beaverboard answered 7/2, 2017 at 15:4 Comment(0)

You are passing a target array of shape (x-dim, y-dim) while using as loss categorical_crossentropy. categorical_crossentropy expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

from keras.utils import to_categorical
y_binary = to_categorical(y_int)

Alternatively, you can use the loss function sparse_categorical_crossentropy instead, which does expect integer targets.

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Jim answered 16/4, 2019 at 12:38 Comment(0)

The main point is answered satisfactorily with the brilliant piece of sleuthing by desernaut. However there are occasions when BCE (binary cross entropy) could throw different results than CCE (categorical cross entropy) and may be the preferred choice. While the thumb rules shared above (which loss to select) work fine for 99% of the cases, I would like to add a few new dimensions to this discussion.

The OP had a softmax activation and this throws a probability distribution as the predicted value. It is a multi-class problem. The preferred loss is categorical CE. Essentially this boils down to -ln(p) where 'p' is the predicted probability of the lone positive class in the sample. This means that the negative predictions dont have a role to play in calculating CE. This is by intention.

On a rare occasion, it may be needed to make the -ve voices count. This can be done by treating the above sample as a series of binary predictions. So if expected is [1 0 0 0 0] and predicted is [0.1 0.5 0.1 0.1 0.2], this is further broken down into:

expected = [1,0], [0,1], [0,1], [0,1], [0,1]
predicted = [0.1, 0.9], [.5, .5], [.1, .9], [.1, .9], [.2, .8]

Now we proceed to compute 5 different cross entropies - one for each of the above 5 expected/predicted combo and sum them up. Then:

CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.8)]

The CE has a different scale but continues to be a measure of the difference between the expected and predicted values. The only difference is that in this scheme, the -ve values are also penalized/rewarded along with the +ve values. In case your problem is such that you are going to use the output probabilities (both +ve and -ves) instead of using the max() to predict just the 1 +ve label, then you may want to consider this version of CE.

How about a multi-label situation where expected = [1 0 0 0 1]? Conventional approach is to use one sigmoid per output neuron instead of an overall softmax. This ensures that the output probabilities are independent of each other. So we get something like:

expected = [1 0 0 0 1]
predicted is = [0.1 0.5 0.1 0.1 0.9]

By definition, CE measures the difference between 2 probability distributions. But the above two lists are not probability distributions. Probability distributions should always add up to 1. So conventional solution is to use same loss approach as before - break the expected and predicted values into 5 individual probability distributions, proceed to calculate 5 cross entropies and sum them up. Then:

CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.9)] = 3.3

The challenge happens when the number of classes may be very high - say a 1000 and there may be only couple of them present in each sample. So the expected is something like: [1,0,0,0,0,0,1,0,0,0.....990 zeroes]. The predicted could be something like: [.8, .1, .1, .1, .1, .1, .8, .1, .1, .1.....990 0.1's]

In this case the CE =

- [ ln(.8) + ln(.8) for the 2 +ve classes and 998 * ln(0.9) for the 998 -ve classes]

= 0.44 (for the +ve classes) +  105 (for the negative classes)

You can see how the -ve classes are beginning to create a nuisance value when calculating the loss. The voice of the +ve samples (which may be all that we care about) is getting drowned out. What do we do? We can't use categorical CE (the version where only +ve samples are considered in calculation). This is because, we are forced to break up the probability distributions into multiple binary probability distributions because otherwise it would not be a probability distribution in the first place. Once we break it into multiple binary probability distributions, we have no choice but to use binary CE and this of course gives weightage to -ve classes.

One option is to drown the voice of the -ve classes by a multiplier. So we multiply all -ve losses by a value gamma where gamma < 1. Say in above case, gamma can be .0001. Now the loss comes to:

= 0.44 (for the +ve classes) +  0.105 (for the negative classes)

The nuisance value has come down. 2 years back Facebook did that and much more in a paper they came up with where they also multiplied the -ve losses by p to the power of x. 'p' is the probability of the output being a +ve and x is a constant>1. This penalized -ve losses even further especially the ones where the model is pretty confident (where 1-p is close to 1). This combined effect of punishing negative class losses combined with harsher punishment for the easily classified cases (which accounted for majority of the -ve cases) worked beautifully for Facebook and they called it focal loss.

So in response to OP's question of whether binary CE makes any sense at all in his case, the answer is - it depends. In 99% of the cases the conventional thumb rules work but there could be occasions when these rules could be bent or even broken to suit the problem at hand.

For a more in-depth treatment, you can refer to: https://towardsdatascience.com/cross-entropy-classification-losses-no-math-few-stories-lots-of-intuition-d56f8c7f06b0

Overabundance answered 16/3, 2021 at 14:7 Comment(0)

when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).

Began answered 2/2, 2018 at 23:9 Comment(1)

How exactly this answers the question? – Taker 13/6, 2018 at 10:45

Take a look at the equation you can find that binary cross entropy not only punish those label = 1, predicted =0, but also label = 0, predicted = 1.

However categorical cross entropy only punish those label = 1 but predicted = 1.That's why we make assumption that there is only ONE label positive.

Liatris answered 7/5, 2019 at 22:59 Comment(0)

-4

The binary_crossentropy(y_target, y_predict) doesn't need to apply to binary classification problem.

In the source code of binary_crossentropy(), the nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output) of tensorflow was actually used.

And, in the documentation, it says that:

Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.

Gifted answered 21/2, 2019 at 16:34 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags