How to interpret loss and accuracy for a machine learning model [closed]
O

3

277

When I trained my neural network with Theano or Tensorflow, they will report a variable called "loss" per epoch.

How should I interpret this variable? Higher loss is better or worse, or what does it mean for the final performance (accuracy) of my neural network?

Obvious answered 29/12, 2015 at 20:33 Comment(1)
I’m voting to close this question because Machine learning (ML) theory questions are off-topic on Stack Overflow - gift-wrap candidate for Cross-ValidatedKrantz
D
371

The lower the loss, the better a model (unless the model has over-fitted to the training data). The loss is calculated on training and validation and its interperation is how well the model is doing for these two sets. Unlike accuracy, loss is not a percentage. It is a summation of the errors made for each example in training or validation sets.

In the case of neural networks, the loss is usually negative log-likelihood and residual sum of squares for classification and regression respectively. Then naturally, the main objective in a learning model is to reduce (minimize) the loss function's value with respect to the model's parameters by changing the weight vector values through different optimization methods, such as backpropagation in neural networks.

Loss value implies how well or poorly a certain model behaves after each iteration of optimization. Ideally, one would expect the reduction of loss after each, or several, iteration(s).

The accuracy of a model is usually determined after the model parameters are learned and fixed and no learning is taking place. Then the test samples are fed to the model and the number of mistakes (zero-one loss) the model makes are recorded, after comparison to the true targets. Then the percentage of misclassification is calculated.

For example, if the number of test samples is 1000 and model classifies 952 of those correctly, then the model's accuracy is 95.2%.

enter image description here

There are also some subtleties while reducing the loss value. For instance, you may run into the problem of over-fitting in which the model "memorizes" the training examples and becomes kind of ineffective for the test set. Over-fitting also occurs in cases where you do not employ a regularization, you have a very complex model (the number of free parameters W is large) or the number of data points N is very low.

Drewdrewett answered 29/12, 2015 at 21:21 Comment(12)
Hi @Amir, thanks for your very details explanation. However, I have a problem: in my Neural Network, loss always decrease when I trained (when the epochs increase), however the accuracy is not better.Obvious
@Obvious As long as the cost is decreasing you should be good to go. Although cost and accuracy normally have a inverse proportionality relationship, but you may note that accuracy is a summation of zero-one errors whereas cost is a summation of floating point numbers. Therefore, 0.001% decrease in the cost does not necessarily mean 0.001% increase in the accuracy. Increasing the accuracy is much harder when the decrement in cost is intangible (the cost is very close to a local minima)Drewdrewett
@Obvious I should have said as long as the cost for both training and validation is decreasing you should be good to go. You may also check validation accuracy on each epoch. If it starts going up, then your model might have started to over-fit and you should stop training it.Drewdrewett
@Drewdrewett Thanks for the nice explanation. I just have a confusion in understanding the role of Validation set. I am following this tutorial. What don't I understand is why we need to calculate the validation loss? and what validation loss contributes to the training?Specialism
In my case, I am training 3 different models with different parameters like batch_size, regularization_strength, learning_rate. But number of epochs is same for all. For a given number of epochs, if I look for validation loss, model 2 has the highest loss value. But the accuracy of it is higher. How can it be possible ? Or I have some bug ?Compressor
@Compressor I don't think that is a bug. Normally the way you calculate the classification accuracy is to taking the argmax over the predicted vector. Say your vector has 3 elements and you expect the model to give higher score to the second element (because the target class is the class No. 2). The model prediction (before applying the softmax) is as follow: [0.45, 0.55, 0.35]; here, although the model has predicted the class correctly but the error is high. If the model prediction was something like [-0.2, 1.13, 0.3] then the error would be lower but still the class prediction is correctDrewdrewett
Could you explain it in case of hinge loss, instead of softmax for same example ? I am using hinge loss.Compressor
A couple links I found that help elucidate: 1. Per datascience.stackexchange.com/questions/13663/… , you should optimize for loss. 2. Per #40911357 , you want to keep an eye on loss vs accuracy trending, and implement Early Stopping for your model.Essen
It is usually mean not summation.Anxious
Why not train the model to increase accuracy rather than to minimise loss?Propensity
@bikashg accuracy is not differentiable and therefore you can't backprop on it.Exact
I should link stats.stackexchange.com/questions/282160 here.Adolphus
S
32

They are two different metrics to evaluate your model's performance usually being used in different phases.

Loss is often used in the training process to find the "best" parameter values for your model (e.g. weights in neural network). It is what you try to optimize in the training by updating weights.

Accuracy is more from an applied perspective. Once you find the optimized parameters above, you use this metrics to evaluate how accurate your model's prediction is compared to the true data.

Let us use a toy classification example. You want to predict gender from one's weight and height. You have 3 data, they are as follows:(0 stands for male, 1 stands for female)

y1 = 0, x1_w = 50kg, x2_h = 160cm;

y2 = 0, x2_w = 60kg, x2_h = 170cm;

y3 = 1, x3_w = 55kg, x3_h = 175cm;

You use a simple logistic regression model, that is

a simple logistic regression model, y = 1/(1+exp{-(b1x_w+b2x_h)})

How do you find b1 and b2? You define a loss first and use optimization method to minimize the loss in an iterative way by updating b1 and b2.

In our example, a typical loss for this binary classification problem can be: (a minus sign should be added in front of the summation sign)

We don't know what b1 and b2 should be. Let us make a random guess say b1 = 0.1 and b2 = -0.03. Then what is our loss now?

so the loss is

Then your learning algorithm (e.g. gradient descent) will find a way to update b1 and b2 to decrease the loss.

What if b1=0.1 and b2=-0.03 is the final b1 and b2 (output from gradient descent), what is the accuracy now?

Let's assume if y_hat >= 0.5, we decide our prediction is female(1), otherwise it would be 0. Therefore, our algorithm predicts y1 = 1, y2 = 1 and y3 = 1. What is our accuracy? We make a wrong prediction on y1 and y2 and make a correct one on y3. So now our accuracy is 1/3 = 33.33%.

PS: In Amir's answer, back-propagation is said to be an optimization method in NN. I think it would be treated as a way to find gradient for weights in NN. Common optimization method in NN are GradientDescent and Adam.

Sthilaire answered 17/10, 2017 at 22:46 Comment(2)
thank you for the math. it helped clarifying the concept.Paintbox
Your math equations should be converted to mathjax they are rendering very strangely.Deerstalker
Q
10

Just to clarify the Training/Validation/Test data sets: The training set is used to perform the initial training of the model, initializing the weights of the neural network.

The validation set is used after the neural network has been trained. It is used for tuning the network's hyperparameters, and comparing how changes to them affect the predictive accuracy of the model. Whereas the training set can be thought of as being used to build the neural network's gate weights, the validation set allows fine tuning of the parameters or architecture of the neural network model. It's useful as it allows repeatable comparison of these different parameters/architectures against the same data and networks weights, to observe how parameter/architecture changes affect the predictive power of the network.

Then the test set is used only to test the predictive accuracy of the trained neural network on previously unseen data, after training and parameter/architecture selection with the training and validation data sets.

Quiddity answered 1/11, 2017 at 14:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.