In which cases is the cross-entropy preferred over the mean squared error? [closed]
I

3

69

Although both of the above methods provide a better score for the better closeness of prediction, still cross-entropy is preferred. Is it in every case or there are some peculiar scenarios where we prefer cross-entropy over MSE?

Inquietude answered 9/4, 2016 at 9:50 Comment(1)
P
75

Cross-entropy is prefered for classification, while mean squared error is one of the best choices for regression. This comes directly from the statement of the problems itself - in classification you work with very particular set of possible output values thus MSE is badly defined (as it does not have this kind of knowledge thus penalizes errors in incompatible way). To better understand the phenomena it is good to follow and understand the relations between

  1. cross entropy
  2. logistic regression (binary cross entropy)
  3. linear regression (MSE)

You will notice that both can be seen as a maximum likelihood estimators, simply with different assumptions about the dependent variable.

Phono answered 9/4, 2016 at 11:52 Comment(3)
Could you please elaborate more on "assumptions about the dependent variable" ?Emulsion
@Fake - as Duc pointed out in the separate answer, logistic regression assumes binomial distribution (or multinomial in generalised case of cross entropy and softmax) of the dependent variable, while linear regression assumes that it is a linear function of the variables plus an IID sampled noise from a 0-mean gaussian noise with fixed variance.Phono
I once trained a single output neuron using MSE-loss to output 0 or 1 [for negative and positive classes]. The result was that all the outputs were at the extremes - you couldn't pick a threshold. Using two neurons with CE loss got me a much smoother result, so I could pick a threshold. Probably BCE is what you want to use if you stay with a single neuron.Potpie
S
47

When you derive the cost function from the aspect of probability and distribution, you can observe that MSE happens when you assume the error follows Normal Distribution and cross entropy when you assume binomial distribution. It means that implicitly when you use MSE, you are doing regression (estimation) and when you use CE, you are doing classification. Hope it helps a little bit.

Stafani answered 11/4, 2016 at 9:3 Comment(3)
Say we have 2 probability distribution vectors:- actual [0.3, 0.5, 0.1, 0.1] and predicted [0.4, 0.2, 0.3, 0.1] Now if we use MSE to determine our loss, why would this be a bad choice than KL divergence? What are the features that are missed when we perform MSE on such a data?Lightner
Could you show how gaussian leads to MSE and binomial leads to cross entropy?Glover
@KunyuShi Look at the PDF/PMF of the normal and Bernoulli distributions. If we take their log (which we generally do, to simplify the loss function) we get MSE and binary crossentropy, respectively.Receipt
N
15

If you do logistic regression for example, you will use the sigmoid function to estimate the probability, the cross entropy as the loss function and gradient descent to minimize it. Doing this but using MSE as the loss function might lead to a non-convex problem where you might find local minima. Using cross entropy will lead to a convex problem where you might find the optimum solution.

https://www.youtube.com/watch?v=rtD0RvfBJqQ&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=35

There is also an interesting analysis here: https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/

Nagaland answered 24/4, 2017 at 16:29 Comment(2)
The youtube link no longer works.Cowl
Sharing a different video that also explains the convexity point - youtu.be/m0ZeT1EWjjI Also, check this additional one - youtu.be/gIx974WtVb4Casualty

© 2022 - 2024 — McMap. All rights reserved.