Comparing MSE loss and cross-entropy loss in terms of convergence
Asked Answered
S

4

5

For a very simple classification problem where I have a target vector [0,0,0,....0] and a prediction vector [0,0.1,0.2,....1] would cross-entropy loss converge better/faster or would MSE loss? When I plot them it seems to me that MSE loss has a lower error margin. Why would that be? enter image description here

Or for example when I have the target as [1,1,1,1....1] I get the following: enter image description here

Spiro answered 16/3, 2018 at 13:41 Comment(2)
My answer on MSE vs cross_entopy will be helpful.Sonometer
@vipinbansal unfortunately it won'tYelena
B
8

You sound a little confused...

  • Comparing the values of MSE & cross-entropy loss and saying that one is lower than the other is like comparing apples to oranges
  • MSE is for regression problems, while cross-entropy loss is for classification ones; these contexts are mutually exclusive, hence comparing the numerical values of their corresponding loss measures makes no sense
  • When your prediction vector is like [0,0.1,0.2,....1] (i.e. with non-integer components), as you say, the problem is a regression (and not a classification) one; in classification settings, we usually use one-hot encoded target vectors, where only one component is 1 and the rest are 0
  • A target vector of [1,1,1,1....1] could be the case either in a regression setting, or in a multi-label multi-class classification, i.e. where the output may belong to more than one class simultaneously

On top of these, your plot choice, with the percentage (?) of predictions in the horizontal axis, is puzzling - I have never seen such plots in ML diagnostics, and I am not quite sure what exactly they represent or why they can be useful...

If you like a detailed discussion of the cross-entropy loss & accuracy in classification settings, you may have a look at this answer of mine.

Battat answered 16/3, 2018 at 15:55 Comment(5)
"MSE is for regression problems, while cross-entropy loss is for classification ones..." I hear this a lot, but I am yet to find a good explanation as to why MSE can't/shouldn't be used for for classification problems. The only thing I can think of is that the log function is steeper than the squared function and so it will penalize bad-predictions better and will theoretically lead to faster convergence.Carrington
@Super-intelligentShade see last part of own answer hereBattat
Thanks desertnaut, but that question is the other way around. I still can't see a reason why MSE can't be applied to a classification problem.Carrington
@Super-intelligentShade I didn't say anything about the question; I said - check the last part of my answer, where I partially address this issue (which was irrelevant to the original question there).Battat
I finally found a satisfactory answer by the great Andrew Ng himself: using MSE in combination with sigmoid activation will result in non-convex cost function with many local optima. As simple as that.Carrington
R
10

As complement to the accepted answer, I will answer the following questions

  1. What is the interpretation of MSE loss and cross entropy loss from probability perspective?
  2. Why cross entropy is used for classification and MSE is used for linear regression?

TL;DR Use MSE loss if (random) target variable is from Gaussian distribution and categorical cross entropy loss if (random) target variable is from Multinomial distribution.

MSE (Mean squared error)

One of the assumptions of the linear regression is multi-variant normality. From this it follows that the target variable is normally distributed(more on the assumptions of linear regression can be found here and here).

Gaussian distribution(Normal distribution) with mean eq2 and variance eq3 is given by
eq1
Often in machine learning we deal with distribution with mean 0 and variance 1(Or we transform our data to have mean 0 and variance 1). In this case the normal distribution will be,
eq4 This is called standard normal distribution.
For normal distribution model with weight parameter eq6 and precision(inverse variance) parameter eq6, the probability of observing a single target t given input x is expressed by the following equation

eq , where eq is mean of the distribution and is calculated by model as
eq

Now the probability of target vector eq given input eq can be expressed by

eq eq4
Taking natural logarithm of left and right terms yields

eq
eq eq
Where eq is log likelihood of normal function. Often training a model involves optimizing the likelihood function with respect to eq. Now maximum likelihood function for parameter eq is given by (constant terms with respect to eq can be omitted),

eq

For training the model omitting the constant eq doesn't affect the convergence. eq This is called squared error and taking the mean yields mean squared error.
eq,

Cross entropy

Before going into more general cross entropy function, I will explain specific type of cross entropy - binary cross entropy.

Binary Cross entropy

The assumption of binary cross entropy is probability distribution of target variable is drawn from Bernoulli distribution. According to Wikipedia

Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p

Probability of Bernoulli distribution random variable is given by
eq, where eq and p is probability of success. This can be simply written as eq
Taking negative natural logarithm of both sides yields

eq, this is called binary cross entropy.

Categorical cross entropy

Generalization of the cross entropy follows the general case when the random variable is multi-variant(is from Multinomial distribution ) with the following probability distribution

eq

Taking negative natural logarithm of both sides yields categorical cross entropy loss.

eq10,

Redford answered 26/12, 2018 at 14:8 Comment(0)
B
8

You sound a little confused...

  • Comparing the values of MSE & cross-entropy loss and saying that one is lower than the other is like comparing apples to oranges
  • MSE is for regression problems, while cross-entropy loss is for classification ones; these contexts are mutually exclusive, hence comparing the numerical values of their corresponding loss measures makes no sense
  • When your prediction vector is like [0,0.1,0.2,....1] (i.e. with non-integer components), as you say, the problem is a regression (and not a classification) one; in classification settings, we usually use one-hot encoded target vectors, where only one component is 1 and the rest are 0
  • A target vector of [1,1,1,1....1] could be the case either in a regression setting, or in a multi-label multi-class classification, i.e. where the output may belong to more than one class simultaneously

On top of these, your plot choice, with the percentage (?) of predictions in the horizontal axis, is puzzling - I have never seen such plots in ML diagnostics, and I am not quite sure what exactly they represent or why they can be useful...

If you like a detailed discussion of the cross-entropy loss & accuracy in classification settings, you may have a look at this answer of mine.

Battat answered 16/3, 2018 at 15:55 Comment(5)
"MSE is for regression problems, while cross-entropy loss is for classification ones..." I hear this a lot, but I am yet to find a good explanation as to why MSE can't/shouldn't be used for for classification problems. The only thing I can think of is that the log function is steeper than the squared function and so it will penalize bad-predictions better and will theoretically lead to faster convergence.Carrington
@Super-intelligentShade see last part of own answer hereBattat
Thanks desertnaut, but that question is the other way around. I still can't see a reason why MSE can't be applied to a classification problem.Carrington
@Super-intelligentShade I didn't say anything about the question; I said - check the last part of my answer, where I partially address this issue (which was irrelevant to the original question there).Battat
I finally found a satisfactory answer by the great Andrew Ng himself: using MSE in combination with sigmoid activation will result in non-convex cost function with many local optima. As simple as that.Carrington
N
1

I tend to disagree with the previously given answers. The point is that the cross-entropy and MSE loss are the same.

The modern NN learn their parameters using maximum likelihood estimation (MLE) of the parameter space. The maximum likelihood estimator is given by argmax of the product of probability distribution over the parameter space. If we apply a log transformation and scale the MLE by the number of free parameters, we will get an expectation of the empirical distribution defined by the training data.

Furthermore, we can assume different priors, e.g. Gaussian or Bernoulli, which yield either the MSE loss or negative log-likelihood of the sigmoid function.

For further reading: Ian Goodfellow "Deep Learning"

Niece answered 17/4, 2022 at 18:24 Comment(0)
C
1

A simple answer to your first question:

For a very simple classification problem ... would cross-entropy loss converge better/faster or would MSE loss?

is that MSE loss, when combined with sigmoid activation, will result in non-convex cost function with multiple local minima. This is explained by Prof Andrew Ng in his lecture:

Lecture 6.4 — Logistic Regression | Cost Function — [ Machine Learning | Andrew Ng]

I imagine the same applies to multiclass classification with softmax activation.

Carrington answered 22/8, 2022 at 15:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.