In which cases is the cross-entropy preferred over the mean squared error? [closed]

Asked 9/4, 2016 at 9:50 Answered 24/4, 2017 at 16:29

Solved machine-learning neural-network backpropagation mean-square-error cross-entropy

Although both of the above methods provide a better score for the better closeness of prediction, still cross-entropy is preferred. Is it in every case or there are some peculiar scenarios where we prefer cross-entropy over MSE?

Inquietude answered 9/4, 2016 at 9:50 Comment(1)

See heliosphan.org/cross-entropy.html and heliosphan.org/generative-models.html – Lope 31/10, 2016 at 22:51

Cross-entropy is prefered for classification, while mean squared error is one of the best choices for regression. This comes directly from the statement of the problems itself - in classification you work with very particular set of possible output values thus MSE is badly defined (as it does not have this kind of knowledge thus penalizes errors in incompatible way). To better understand the phenomena it is good to follow and understand the relations between

cross entropy
logistic regression (binary cross entropy)
linear regression (MSE)

You will notice that both can be seen as a maximum likelihood estimators, simply with different assumptions about the dependent variable.

Phono answered 9/4, 2016 at 11:52 Comment(3)

Could you please elaborate more on "assumptions about the dependent variable" ? – Emulsion 15/11, 2016 at 23:40

@Fake - as Duc pointed out in the separate answer, logistic regression assumes binomial distribution (or multinomial in generalised case of cross entropy and softmax) of the dependent variable, while linear regression assumes that it is a linear function of the variables plus an IID sampled noise from a 0-mean gaussian noise with fixed variance. – Phono 28/8, 2017 at 15:32

I once trained a single output neuron using MSE-loss to output 0 or 1 [for negative and positive classes]. The result was that all the outputs were at the extremes - you couldn't pick a threshold. Using two neurons with CE loss got me a much smoother result, so I could pick a threshold. Probably BCE is what you want to use if you stay with a single neuron. – Potpie 3/6, 2020 at 9:54

When you derive the cost function from the aspect of probability and distribution, you can observe that MSE happens when you assume the error follows Normal Distribution and cross entropy when you assume binomial distribution. It means that implicitly when you use MSE, you are doing regression (estimation) and when you use CE, you are doing classification. Hope it helps a little bit.

Stafani answered 11/4, 2016 at 9:3 Comment(3)

Say we have 2 probability distribution vectors:- actual [0.3, 0.5, 0.1, 0.1] and predicted [0.4, 0.2, 0.3, 0.1] Now if we use MSE to determine our loss, why would this be a bad choice than KL divergence? What are the features that are missed when we perform MSE on such a data? – Lightner 23/4, 2019 at 12:33

Could you show how gaussian leads to MSE and binomial leads to cross entropy? – Glover 14/8, 2019 at 5:52

@KunyuShi Look at the PDF/PMF of the normal and Bernoulli distributions. If we take their log (which we generally do, to simplify the loss function) we get MSE and binary crossentropy, respectively. – Receipt 6/2, 2020 at 23:36

If you do logistic regression for example, you will use the sigmoid function to estimate the probability, the cross entropy as the loss function and gradient descent to minimize it. Doing this but using MSE as the loss function might lead to a non-convex problem where you might find local minima. Using cross entropy will lead to a convex problem where you might find the optimum solution.

https://www.youtube.com/watch?v=rtD0RvfBJqQ&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=35

There is also an interesting analysis here: https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/

Nagaland answered 24/4, 2017 at 16:29 Comment(2)

The youtube link no longer works. – Cowl 13/2, 2019 at 2:49

Sharing a different video that also explains the convexity point - youtu.be/m0ZeT1EWjjI Also, check this additional one - youtu.be/gIx974WtVb4 – Casualty 5/7, 2023 at 16:24

Recommended topics

Hot tags