What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?
Asked Answered
P

8

481

In the tensorflow API docs they use a keyword called logits. What is it? A lot of methods are written like:

tf.nn.softmax(logits, name=None)

If logits is just a generic Tensor input, why is it named logits?


Secondly, what is the difference between the following two methods?

tf.nn.softmax(logits, name=None)
tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)

I know what tf.nn.softmax does, but not the other. An example would be really helpful.

Plaque answered 12/12, 2015 at 14:3 Comment(1)
see this: stats.stackexchange.com/questions/52825/…Spendable
K
526

The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then and then computes the cross entropy of those values vs. what they "should" be as defined by the labels.

tf.nn.softmax produces the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1, and it does the mapping by interpreting the inputs as log-probabilities (logits) and then converting them back into raw probabilities between 0 and 1. The shape of output of a softmax is the same as the input:

a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508  0.205666    0.25120102  0.37474789]]

See this answer for more about why softmax is used extensively in DNNs.

tf.nn.softmax_cross_entropy_with_logits combines the softmax step with the calculation of the cross-entropy loss after applying the softmax function, but it does it all together in a more mathematically careful way. It's similar to the result of:

sm = tf.nn.softmax(x)
ce = cross_entropy(sm)

The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).

If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.

Edited 2016-02-07: If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.

Kinnikinnick answered 12/12, 2015 at 19:1 Comment(9)
About the softmax_cross_entropy_with_logits, I don't know if I use it correctly. The result is not that stable in my code. The same code runs twice, the total accuracy changes from 0.6 to 0.8. cross_entropy = tf.nn.softmax_cross_entropy_with_logits(tf.nn.softmax(tf.add(tf.matmul(x,W),b)),y) cost=tf.reduce_mean(cross_entropy). But when I use another way, pred=tf.nn.softmax(tf.add(tf.matmul(x,W),b)) cost =tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred),reduction_indices=1)) the result is stable and better.Clachan
You're double-softmaxing in your first line. softmax_cross_entropy_with_logits expects unscaled logits, not the output of tf.nn.softmax. You just want tf.nn.softmax_cross_entropy_with_logits(tf.add(tf.matmul(x, W, b)) in your case.Kinnikinnick
@Kinnikinnick I think you have a typo in your code, the b needs to be outside of the bracket, tf.nn.softmax_cross_entropy_with_logits(tf.add(tf.matmul(x, W), b)Unsuccess
what does "that the relative scale to understand the units is linear." part of your first sentence mean?Spendable
Re: "because it covers numerically unstable corner cases" I'm wondering if it's true. By definition of Softmax, it's numerically stable even if all the values are zeros or negative.Malignancy
Upvoted-but your answer is slightly incorrect when you say that "[t]he shape of output of a softmax is the same as the input - it just normalizes the values". Softmax doesn't just "squash" the values so that their sum equals 1. It also redistributes them, and that's possibly the main reason why it's used. See #17188007, especially Piotr Czapla's answer.Ultimogeniture
Somebody asking this question should really learn first off that logits is short for the log-odds of events in logistic regressions.Giacopo
link: en.wikipedia.org/wiki/Logistic_regression#Logistic_model the statistical log-odds are precisely that: the log of the odds. passing them through softmax makes the 0-1 ranged odds. i don't think statistics and machine learning are separate here: the terms appear to mean the same thing. Here's a quote from the top of that page: "The unit of measurement for the log-odds scale is called a logit, from logistic unit"Giacopo
Thanks! I have, I hope, now incorporated these comments into the answer above.Kinnikinnick
S
327

Short version:

Suppose you have two tensors, where y_hat contains computed scores for each class (for example, from y = W*x +b) and y_true contains one-hot encoded true labels.

y_hat  = ... # Predicted label, e.g. y = tf.matmul(X, W) + b
y_true = ... # True label, one-hot encoded

If you interpret the scores in y_hat as unnormalized log probabilities, then they are logits.

Additionally, the total cross-entropy loss computed in this manner:

y_hat_softmax = tf.nn.softmax(y_hat)
total_loss = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), [1]))

is essentially equivalent to the total cross-entropy loss computed with the function softmax_cross_entropy_with_logits():

total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))

Long version:

In the output layer of your neural network, you will probably compute an array that contains the class scores for each of your training instances, such as from a computation y_hat = W*x + b. To serve as an example, below I've created a y_hat as a 2 x 3 array, where the rows correspond to the training instances and the columns correspond to classes. So here there are 2 training instances and 3 classes.

import tensorflow as tf
import numpy as np

sess = tf.Session()

# Create example y_hat.
y_hat = tf.convert_to_tensor(np.array([[0.5, 1.5, 0.1],[2.2, 1.3, 1.7]]))
sess.run(y_hat)
# array([[ 0.5,  1.5,  0.1],
#        [ 2.2,  1.3,  1.7]])

Note that the values are not normalized (i.e. the rows don't add up to 1). In order to normalize them, we can apply the softmax function, which interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.

y_hat_softmax = tf.nn.softmax(y_hat)
sess.run(y_hat_softmax)
# array([[ 0.227863  ,  0.61939586,  0.15274114],
#        [ 0.49674623,  0.20196195,  0.30129182]])

It's important to fully understand what the softmax output is saying. Below I've shown a table that more clearly represents the output above. It can be seen that, for example, the probability of training instance 1 being "Class 2" is 0.619. The class probabilities for each training instance are normalized, so the sum of each row is 1.0.

                      Pr(Class 1)  Pr(Class 2)  Pr(Class 3)
                    ,--------------------------------------
Training instance 1 | 0.227863   | 0.61939586 | 0.15274114
Training instance 2 | 0.49674623 | 0.20196195 | 0.30129182

So now we have class probabilities for each training instance, where we can take the argmax() of each row to generate a final classification. From above, we may generate that training instance 1 belongs to "Class 2" and training instance 2 belongs to "Class 1".

Are these classifications correct? We need to measure against the true labels from the training set. You will need a one-hot encoded y_true array, where again the rows are training instances and columns are classes. Below I've created an example y_true one-hot array where the true label for training instance 1 is "Class 2" and the true label for training instance 2 is "Class 3".

y_true = tf.convert_to_tensor(np.array([[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]))
sess.run(y_true)
# array([[ 0.,  1.,  0.],
#        [ 0.,  0.,  1.]])

Is the probability distribution in y_hat_softmax close to the probability distribution in y_true? We can use cross-entropy loss to measure the error.

Formula for cross-entropy loss

We can compute the cross-entropy loss on a row-wise basis and see the results. Below we can see that training instance 1 has a loss of 0.479, while training instance 2 has a higher loss of 1.200. This result makes sense because in our example above, y_hat_softmax showed that training instance 1's highest probability was for "Class 2", which matches training instance 1 in y_true; however, the prediction for training instance 2 showed a highest probability for "Class 1", which does not match the true class "Class 3".

loss_per_instance_1 = -tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1])
sess.run(loss_per_instance_1)
# array([ 0.4790107 ,  1.19967598])

What we really want is the total loss over all the training instances. So we can compute:

total_loss_1 = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1]))
sess.run(total_loss_1)
# 0.83934333897877944

Using softmax_cross_entropy_with_logits()

We can instead compute the total cross entropy loss using the tf.nn.softmax_cross_entropy_with_logits() function, as shown below.

loss_per_instance_2 = tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true)
sess.run(loss_per_instance_2)
# array([ 0.4790107 ,  1.19967598])

total_loss_2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
sess.run(total_loss_2)
# 0.83934333897877922

Note that total_loss_1 and total_loss_2 produce essentially equivalent results with some small differences in the very final digits. However, you might as well use the second approach: it takes one less line of code and accumulates less numerical error because the softmax is done for you inside of softmax_cross_entropy_with_logits().

Suchlike answered 14/9, 2016 at 20:54 Comment(3)
I confirm all of the above. The simple code: M = tf.random.uniform([100, 10], minval=-1.0, maxval=1.0); labels = tf.one_hot(tf.random.uniform([100], minval=0, maxval=10 , dtype='int32'), 10); tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=M) - tf.reduce_sum(-tf.nn.log_softmax(M)*tf.one_hot(labels, 10), -1) returns close-to-zero everywhereLorenzo
Sorry for simple/dummy question. I didn't understand getting loss '0.479' from training instance-1. True label for instance-1 is '2'. If I apply -1xlog2(0.619) I get 0.691.Protocol
Edit: Loss is calculated using log 'e' base, okay.Protocol
S
69

tf.nn.softmax computes the forward propagation through a softmax layer. You use it during evaluation of the model when you compute the probabilities that the model outputs.

tf.nn.softmax_cross_entropy_with_logits computes the cost for a softmax layer. It is only used during training.

The logits are the unnormalized log probabilities output the model (the values output before the softmax normalization is applied to them).

Silent answered 14/12, 2015 at 16:47 Comment(7)
I get it. Why not call the function, tf.nn.softmax_cross_entropy_sans_normalization?Interrogative
@Interrogative because it normalizes the values (internally) during the cross-entropy computation. The point of tf.nn.softmax_cross_entropy_with_logits is to evaluate how much the model deviates from the gold labels, not to provide a normalized output.Zoraidazorana
In the case of using tf.nn.sparse_softmax_cross_entropy_with_logits() computes the cost of a sparse softmax layer, and thus should only be used during training what would be the alternative when running the model against new data, is it possible to obtain probabilities from this one.Anchie
@SerialDev, it's not possible to get probabilities from tf.nn.sparse_softmax_cross_entropy_with_logits. To get probabilities use tf.nn.softmax.Biradial
They're not log probabilities but log odds.Miliary
Link to log odds page on wikipedia now links to a function rather than a probability scale. en.wikipedia.org/wiki/Logistic_regression mentions "logit, from logistic unit".Giacopo
But why in loss functions like BinaryCrossentropy and SparseCategoricalCrossentropy the parameter from_logits set to False by default? Does it mean that we always use probabilities when we use `model.compile(loss="sparse_categorical_crossentropy"...)?Ezequiel
M
16

Mathematical motivation for term

When we wish to constrain an output between 0 and 1, but our model architecture outputs unconstrained values, we can add a normalisation layer to enforce this.

A common choice is a sigmoid function.1 In binary classification this is typically the logistic function, and in multi-class tasks the multinomial logistic function (a.k.a softmax).2

If we want to interpret the outputs of our new final layer as 'probabilities', then (by implication) the unconstrained inputs to our sigmoid must be inverse-sigmoid(probabilities). In the logistic case this is equivalent to the log-odds of our probability (i.e. the log of the odds) a.k.a. logit:

That is why the arguments to softmax is called logits in Tensorflow - because under the assumption that softmax is the final layer in the model, and the output p is interpreted as a probability, the input x to this layer is interpretable as a logit:

enter image description here enter image description here

Generalised term

In Machine Learning there is a propensity to generalise terminology borrowed from maths/stats/computer science, hence in Tensorflow logit (by analogy) is used as a synonym for the input to many normalisation functions.


  1. While it has nice properties such as being easily diferentiable, and the aforementioned probabilistic interpretation, it is somewhat arbitrary.
  2. softmax might be more accurately called softargmax, as it is a smooth approximation of the argmax function.
Miliary answered 25/3, 2021 at 16:53 Comment(0)
C
6

Above answers have enough description for the asked question.

Adding to that, Tensorflow has optimised the operation of applying the activation function then calculating cost using its own activation followed by cost functions. Hence it is a good practice to use: tf.nn.softmax_cross_entropy() over tf.nn.softmax(); tf.nn.cross_entropy()

You can find prominent difference between them in a resource intensive model.

Condense answered 19/7, 2017 at 7:25 Comment(2)
the answer above clearly haven't read the question.. They all say the same things, which are known, but don't answer the question itselfIronworks
@abhish Did you mean, tf.nn.softmax followed by tf.losses.softmax_cross_entropy?Kilderkin
L
5

Tensorflow 2.0 Compatible Answer: The explanations of dga and stackoverflowuser2010 are very detailed about Logits and the related Functions.

All those functions, when used in Tensorflow 1.x will work fine, but if you migrate your code from 1.x (1.14, 1.15, etc) to 2.x (2.0, 2.1, etc..), using those functions result in error.

Hence, specifying the 2.0 Compatible Calls for all the functions, we discussed above, if we migrate from 1.x to 2.x, for the benefit of the community.

Functions in 1.x:

  1. tf.nn.softmax
  2. tf.nn.softmax_cross_entropy_with_logits
  3. tf.nn.sparse_softmax_cross_entropy_with_logits

Respective Functions when Migrated from 1.x to 2.x:

  1. tf.compat.v2.nn.softmax
  2. tf.compat.v2.nn.softmax_cross_entropy_with_logits
  3. tf.compat.v2.nn.sparse_softmax_cross_entropy_with_logits

For more information about migration from 1.x to 2.x, please refer this Migration Guide.

Lowboy answered 17/2, 2020 at 12:55 Comment(0)
M
2

Logits are the unnormalized outputs of a neural network. Softmax is a normalization function that squashes the outputs of a neural network so that they are all between 0 and 1 and sum to 1. Softmax_cross_entropy_with_logits is a loss function that takes in the outputs of a neural network (after they have been squashed by softmax) and the true labels for those outputs, and returns a loss value.

Matador answered 14/7, 2022 at 23:3 Comment(0)
H
1

One more thing that I would definitely like to highlight as logit is just a raw output, generally the output of last layer. This can be a negative value as well. If we use it as it's for "cross entropy" evaluation as mentioned below:

-tf.reduce_sum(y_true * tf.log(logits))

then it wont work. As log of -ve is not defined. So using o softmax activation, will overcome this problem.

Harper answered 13/5, 2020 at 12:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.