Noisy training loss

Asked 2/2, 2018 at 9:14 Answered 7/3, 2018 at 9:42

Solved machine-learning neural-network deep-learning tensorboard loss

I am training encoder-decoder attention-based model, with batch size 8. I don't suspect too much noise in the dataset, however the examples come from a few different distributions.

I can see a lot of noise in the train loss curve. After averaging (.99), the tendency is fine. Also the accuracy of the model is not bad.

I'd like to understand what could be the reason of such shape of loss curve

Lumpy answered 2/2, 2018 at 9:14 Comment(3)

Too high learning rate ? – Cristicristian 2/2, 2018 at 9:37

The batch size is really small, try using 32 samples. The less samples in the batch size, the more importance is given to single samples, the more strong is the effect of outliers. – Bluetongue 2/2, 2018 at 10:38

This is encoder-decoder attention-based model, so every example is in fact very complex example, with long sequence as input and different kind and length output. Bigger batch size doesn't fit top GPUs, but thank you – Lumpy 2/2, 2018 at 10:47

I found the answer myself.

I think other answers are not correct, because they are based on experience with simpler models/architectures. The main point that was bothering me was the fact that noise in losses is usually more symmetrical (you can plot the average and the noise is randomly above and below the average). Here, we see more like low-tendency path and sudden peaks.

As I wrote, the architecture I'm using is an encoder-decoder with attention. It can easily concluded that inputs and outputs can have different lengths. The loss is summed over all time-steps, and doesn't need to be divided by the number of time-steps.

https://www.tensorflow.org/tutorials/seq2seq

Important note: It's worth pointing out that we divide the loss by batch_size, so our hyperparameters are "invariant" to batch_size. Some people divide the loss by (batch_size * num_time_steps), which plays down the errors made on short sentences. More subtly, our hyperparameters (applied to the former way) can't be used for the latter way. For example, if both approaches use SGD with a learning of 1.0, the latter approach effectively uses a much smaller learning rate of 1 / num_time_steps.

I was not averaging the loss, that's why the noise is observable.

P.S. Similarly the batch size of for example 8 can have a few hundred inputs and targets so in fact you can't say that it is small or big without knowing the mean length of example.

Lumpy answered 7/3, 2018 at 9:42 Comment(3)

care to elaborate what the solution was? not particular clear from your answer. It seems that the loss was expected after all since you were not averaging? Is this correct? – Theodoretheodoric 19/6, 2018 at 17:33

I didn't understand your question, please ask again. No solution - loss is not averaged over the timesteps (examples length, that is variable), so it is expected to look like this. Longer examples have bigger loss. If you don't want to see that kind of noise, you can average each batch with sum(length_of_each_example_in_batch). – Lumpy 20/6, 2018 at 20:11

Ok thanks. That’s what I understood from your answer. Thanks for clarifying. – Theodoretheodoric 20/6, 2018 at 21:17

You are using mini-batch gradient descent, which computes the gradient of the loss function with respect to only the examples in the mini-batch. However, the loss you are measuring is over all training examples. Overall loss should have a downward trend, but it will often go in the wrong direction because your mini-batch gradient was not an accurate enough estimate of total loss.

Furthermore, you are multiplying the gradient by the learning rate at each step to try and descend the loss function. This is a local approximation and can often overshoot the target minimum and end up at a higher point on the loss surface, especially if your learning rate is high.

Image Source

Think of this image as the loss funciton for a model with only one parameter. We take the gradient at point, multiply by the learning rate to project a line segment in the direction of the gradient (not pictured). We then take the x-value at the end of this line segment as our updated parameter, and finally we compute the loss at this new parameter setting.

If our learning rate was too high, then we will have overshot the minimum that the gradient was pointing towards and possibly ended up at a higher loss, as pictured.

Dodecahedron answered 2/2, 2018 at 10:14 Comment(2)

Please notice that loss is not like random noise, it's more like some batches are trending down having nice low level, but some are producing a very high sudden peaks. Please tell me, taking it into the consideration, do you still find your explanation feasible? – Lumpy 2/2, 2018 at 11:8

I believe so. Some mini-batches will behave well and some won't. Graphs like that are very common. Yours is noisier than most though, probably due to your small mini-batch size, and possibly a learning rate that is a little high. – Dodecahedron 2/2, 2018 at 11:18

Noisy training loss but good accuracy can be due to this reason:

Local minima:

The function can have local minimas, So everytime your gradient descent converges towards the local minimum, the lost/cost decreases. But with good learning rate, the model learns to jump from these points and the gradient descent will converge towards the global minimum which is the solution. So that's why the training loss is very noisy.

Sphingosine answered 2/2, 2018 at 9:54 Comment(0)

Recommended topics

Hot tags