Keras mean squared error loss layer
Asked Answered
C

3

5

I am currently implementing a custom loss layer and in the process, I stumbled upon the implementation of mean squared error in the objectives.py file [1]. I know I'm missing something in my understanding of this loss calculation because I always thought that the average was done separately across the samples for each output in each mini-batch (axis 0 of the tensor) but it appears that the average is actually being done across the last axis, which in a single vector, would mean it's being done across the outputs. I found this by accident while working on my custom loss layer because it requires discounting the loss of a few of the outputs it a training output in a specific place is a specific value. Anyways, is my understanding of the mean squared error incorrect? Why would Keras be using the last axis and thus turning a a 1xn output vector into a 1x1 output vector?

Thanks.

[1] https://github.com/fchollet/keras/blob/master/keras/objectives.py#L7

Campanology answered 17/1, 2017 at 21:48 Comment(6)
What do you think K.mean means? :)Overbear
Sorry- I adjusted my question. I meant that I didn't see where the squaring was happening, not the mean.Campanology
That would be K.squareOverbear
Did you read my whole question?Campanology
Yes, but in any case there are multiple questions here, I was just pointing out one.Overbear
I'm not asking how to calculate the square, I'm asking why the default MSE function which is supplied in the framework is not calculating the squaring when it is called "Mean Squared Error" I don't see any place in the calculation where the squaring is done. I know how to calculate the squaring, I want to know why the author of that code did not.Campanology
O
9

The code in question for the MSE loss is this:

def mean_squared_error(y_true, y_pred):
    return K.mean(K.square(y_pred - y_true), axis=-1)

Here first y_pred and y_true are subtracted, then that result is passed to K.square, which as expected, returns the square of its parameter, and then that result is given to K.mean, which computes the mean.

So the code clearly is doing what its supposed to do. About why the last axis is operated upon, this has nothing to do with classes, it is just a convention. Note that in general, there are no classes in the MSE definition.

Overbear answered 17/1, 2017 at 22:17 Comment(5)
Ah, you are right that I missed the K.square in the code. Woops. I am on a private network and unfortunately I cannot copy/paste code and have to hand-jam it. In this case, I hand-jammed it improperly. Thus you are right about the last question I asked at the end.Campanology
Thanks for your answer, btw! The axis, however, is really the cause for my question. It is actually a very much large deal to me that they use axis=-1 instead of axis=0 and the reason why is because the convention of how they define the tensors which pass through the network. They force you to use the batch size as the first dimension of the tensor and, for a single set of values in a vector as outputs, force that to be last dimension. This means that they are taking the loss across all of these outputs rather than each one individually.Campanology
I know what I did wrong in my copying. I accidentally copied the mean_absolute_error instead of the mean_squared. That part is fixed but the axis problem still bothers me.Campanology
What do you mean? @CorBrahmaputra
What do you mean? @CoreyJ.Nolet is totally right. The mean should be taken across the batches. Why is it axis=-1?Brahmaputra
K
3

Let's detail the steps of how the losses are computed in Keras to show that the axis=-1 in all the loss computations are correct :

  • So we pick a loss in losses.py that we will pass to the compile method of our model.

  • In compile, the total loss is computed. It happens in several steps : The first step creates a list of losses, one for each output of the model.

  • This first step calls _weighted_masked_objective which according to the docs 'Adds support for masking and sample-weighting to an objective function'
  • Basically, _weighted_masked_objective returns a new objective functions which take into account the weights and mask parameters which the user will provide when using the method fit.

If I cut the code to have only the lines that matter for the question, we get to something like that.

def _weighted_masked_objective(fn):
    def weighted(y_true, y_pred, weights, mask=None):
          score_array = fn(y_true, y_pred) # Compute loss as in losses.py
          return K.mean(score_array) # Average over all axis

class Model(Container):
    def compile(self, optimizer, loss, metrics=None, loss_weights=None,
                sample_weight_mode=None, weighted_metrics=None,
                target_tensors=None, **kwargs):
        weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]

So at the end, the loss is indeed averaged over every dimension, and the use of axis=-1 is just an elegant way to enable masking and weighting of the loss at another point in the code

NB : I didn't explain the other steps because they don't contribute to answering the question.

Krona answered 11/2, 2018 at 19:39 Comment(0)
C
2

I believe, after some conversations with coworkers, that I understand this situation and have a proper solution to the problem. Though I knew that Theano was providing lazy-evaluated tensor functions that were running the matrix operations on the GPU, what I did not realize was that Keras's loss functions are actually written in a way where the compiled theano execution graph is smart enough to cache certain values in order to properly back-propagate the loss values back throughout the network. Because of the type of network I'm creating, I dived into writing my own custom loss function without a completely understanding of how Theano actually treats the loss after it's been calculated by the function.

From what I can tell, my concern was correct that Keras' use of the last axis is a problem. In my case, I have a fully-convolutional deep neural network and the input to the loss function is (x, 7, 16, 16) where the x is the size of the mini-batch. Normally, neural networks output a matrix where the first dimension is the mini-batch size and the second (usually last) dimension is the actual size of the output vector. Because of this, using the last axis in the output tensor to do the actual "mean" portion of the mean squared error is not correct. Instead, the axis should be 1 (in the case of zero-based indexing) because it's the 7 actual regression output features that need to be differentiated for back-propagation.

I originally knew that the axis = -1 may not be correct and the reason I posted this question was because I couldn't quite explain why. It's been a long time since I've had to dive into the math behind the neural networks but when I finally did, I was able to resolve the gaps (I think). I'm posting this response here for future people who may experience this same problem or gap in their understanding of Theano's tensor framework.

Campanology answered 18/1, 2017 at 20:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.