How does binary cross entropy loss work on autoencoders?
Asked Answered
A

1

32

I wrote a vanilla autoencoder using only Dense layer. Below is my code:

iLayer = Input ((784,))
layer1 = Dense(128, activation='relu' ) (iLayer)
layer2 = Dense(64, activation='relu') (layer1)
layer3 = Dense(28, activation ='relu') (layer2)
layer4 = Dense(64, activation='relu') (layer3)
layer5 = Dense(128, activation='relu' ) (layer4)
layer6 = Dense(784, activation='softmax' ) (layer5)
model = Model (iLayer, layer6)
model.compile(loss='binary_crossentropy', optimizer='adam')

(trainX, trainY), (testX, testY) =  mnist.load_data()
print ("shape of the trainX", trainX.shape)
trainX = trainX.reshape(trainX.shape[0], trainX.shape[1]* trainX.shape[2])
print ("shape of the trainX", trainX.shape)
model.fit (trainX, trainX, epochs=5, batch_size=100)

Questions:

1) softmax provides probability distribution. Understood. This means, I would have a vector of 784 values with probability between 0 and 1. For example [ 0.02, 0.03..... upto 784 items], summing all 784 elements provides 1.

2) I don't understand how the binary crossentropy works with these values. Binary cross entropy is for two values of output, right?

Aleida answered 21/9, 2018 at 10:35 Comment(6)
In such contexts (autoencoders), normally the sigmoid activation is used, and not the softmax; have you checked the (very analytical) Keras tutorial on the topic?Gramercy
Thanks for the reply. But, still shall we derive how loss is computed?Aleida
So, I guess that by "error" in the title you actually mean loss, correct?Gramercy
Yes, that's right.Aleida
I edited the title - pls confirm that this is in fact what you ask (I added the autoencoder tag, too)...Gramercy
correct,but in aligned with softmax output.Aleida
J
41

In the context of autoencoders the input and output of the model is the same. So, if the input values are in the range [0,1] then it is acceptable to use sigmoid as the activation function of last layer. Otherwise, you need to use an appropriate activation function for the last layer (e.g. linear which is the default one).

As for the loss function, it comes back to the values of input data again. If the input data are only between zeros and ones (and not the values between them), then binary_crossentropy is acceptable as the loss function. Otherwise, you need to use other loss functions such as 'mse' (i.e. mean squared error) or 'mae' (i.e. mean absolute error). Note that in the case of input values in range [0,1] you can use binary_crossentropy, as it is usually used (e.g. Keras autoencoder tutorial and this paper). However, don't expect that the loss value becomes zero since binary_crossentropy does not return zero when both prediction and label are not either zero or one (no matter they are equal or not). Here is a video from Hugo Larochelle where he explains the loss functions used in autoencoders (the part about using binary_crossentropy with inputs in range [0,1] starts at 5:30)

Concretely, in your example, you are using the MNIST dataset. So by default the values of MNIST are integers in the range [0, 255]. Usually you need to normalize them first:

trainX = trainX.astype('float32')
trainX /= 255.

Now the values would be in range [0,1]. So sigmoid can be used as the activation function and either of binary_crossentropy or mse as the loss function.


Why binary_crossentropy can be used even when the true label values (i.e. ground-truth) are in the range [0,1]?

Note that we are trying to minimize the loss function in training. So if the loss function we have used reaches its minimum value (which may not be necessarily equal to zero) when prediction is equal to true label, then it is an acceptable choice. Let's verify this is the case for binray cross-entropy which is defined as follows:

bce_loss = -y*log(p) - (1-y)*log(1-p)

where y is the true label and p is the predicted value. Let's consider y as fixed and see what value of p minimizes this function: we need to take the derivative with respect to p (I have assumed the log is the natural logarithm function for simplicity of calculations):

bce_loss_derivative = -y*(1/p) - (1-y)*(-1/(1-p)) = 0 =>
                      -y/p + (1-y)/(1-p) = 0 =>
                      -y*(1-p) + (1-y)*p = 0 =>
                      -y + y*p + p - y*p = 0 =>
                       p - y = 0 => y = p

As you can see binary cross-entropy have the minimum value when y=p, i.e. when the true label is equal to predicted label and this is exactly what we are looking for.

Jolynjolynn answered 21/9, 2018 at 11:58 Comment(9)
Not exactly accurate; pls check the Keras tutorial on autoencoders, where binary cross entropy + sigmoid are used for MNIST data (pixel values), which are certainly not binary...Gramercy
@Gramercy I guess that's a bit wrong. Because, binary crossentropy does not return zero when both predictions and labels are the same and they are not either zero and one. In other words, you are predicting correctly, but the loss is not zero! Look at this answer on cross-validated for more info.Jolynjolynn
Don't have an opinion myself (that's why I didn't provide an answer); I just guess that Chollet must know what he is doing, at least regarding such relatively elementary (for him) stuff. Not quite sure about the relevance of the linked thread, too...Gramercy
@Gramercy Of course he knows! I updated my answer. Please take a look.Jolynjolynn
@Gramercy Although you may not have time, I just wanted to let you know that I just added the mathematical proof that why binary_crossentropy can be an acceptable choice. I just thought maybe you are interested to know why. Cheers!Jolynjolynn
You earned a well deserved upvote; happy to trigger you to edit the answer & correct your initial inaccurate claim ("and not the values between them") ;)Gramercy
@Gramercy Thank you very much for that trigger. I learned something new because of that.Jolynjolynn
mae is better for inputs in [0,1]. Getting to zero valued pixels with CE or MSE is difficult.Spiritualist
@Spiritualist I am not sure MAE is better. For one, loss function doesn't have to be zero when the output is optimal, it just needs to be minimal. Moreover, BCE is good when sigmoid is used in the last layer, since the log in BCE will undo the exp in sigmoid, and thus prevent loss saturation (which can then prevent some gradient-based algos from making progress). Source: DLB, chapter 8 (I think)Taps

© 2022 - 2024 — McMap. All rights reserved.