What are the differences between all these cross-entropy losses in Keras and TensorFlow?

Asked 21/6, 2017 at 11:29 Answered 30/1, 2020 at 14:41

Solved tensorflow machine-learning keras loss-function cross-entropy

What are the differences between all these cross-entropy losses?

Keras is talking about

Binary cross-entropy
Categorical cross-entropy
Sparse categorical cross-entropy

While TensorFlow has

Softmax cross-entropy with logits
Sparse softmax cross-entropy with logits
Sigmoid cross-entropy with logits

What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?

Alduino answered 21/6, 2017 at 11:29 Comment(1)

See 3 simple thumb rules @ #47035388 and you should be able to quickly navigate loss function of any framework – Responsive 15/3, 2021 at 14:9

There is just one cross (Shannon) entropy defined as:

H(P||Q) = - SUM_i P(X=i) log Q(X=i)

In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.

There are basically 3 main things to consider:

there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))
one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.
there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").

Depending on these three aspects, different helper function should be used:

                                  outcomes     what is in Q    targets in P   
-------------------------------------------------------------------------------
binary CE                                2      probability         any
categorical CE                          >2      probability         soft
sparse categorical CE                   >2      probability         hard
sigmoid CE with logits                   2      score               any
softmax CE with logits                  >2      score               soft
sparse softmax CE with logits           >2      score               hard

In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).

It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.

As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.

Wanton answered 21/6, 2017 at 18:57 Comment(2)

Cool! It would be great if we knew which ones the keras loss functions represent. – Hydromagnetics 26/6, 2017 at 13:46

I used naming convention compatible with keras and TF, so Keras "binary cross entropy" is "Binary CE" from my table and so on. – Wanton 26/6, 2017 at 18:36

Logits

For this purpose, "logits" can be seen as the non-activated outputs of the model.

While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)
Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss)

Losses "with logits" will apply the activation internally. Some functions allow you to choose logits=True or logits=False, which will tell the function whether to "apply" or "not apply" the activations.

Sparse

Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....
Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]

Binary crossentropy = Sigmoid crossentropy

Problem type:
- single class (false/true); or
- non-exclusive multiclass (many classes may be correct)
Model output shape: (batch, ..., >=1)
Activation: "sigmoid"

Categorical crossentropy = Softmax crossentropy

Problem type: exclusive classes (only one class may be correct)
Model output shape: (batch, ..., >=2)
Activation: "softmax"

Highflown answered 30/1, 2020 at 14:41 Comment(3)

Thanks for the great response. However, why does keras names it as CategoricalCrossEntropy but tensorflow names it as Softmax Cross Entropy? Shouldn't they follow same naming convention for the same loss? – Card 2/1, 2021 at 21:15

Tensorflow applies the softmax for you, Keras doesn't. – Hydromagnetics 2/1, 2021 at 22:3

According to the documentation, Keras CategoricalCrossEntropy also applies softmax when configuring from_logits=True. The link is keras.io/api/losses/probabilistic_losses/… – Card 2/1, 2021 at 22:24

Logits

Sparse

Binary crossentropy = Sigmoid crossentropy

Categorical crossentropy = Softmax crossentropy

Recommended topics

Hot tags