loss calculation over different batch sizes in keras
Asked Answered
D

2

7

I know that in theory, the loss of a network over a batch is just the sum of all the individual losses. This is reflected in the Keras code for calculating total loss. Relevantly:

            for i in range(len(self.outputs)):
            if i in skip_target_indices:
                continue
            y_true = self.targets[i]
            y_pred = self.outputs[i]
            weighted_loss = weighted_losses[i]
            sample_weight = sample_weights[i]
            mask = masks[i]
            loss_weight = loss_weights_list[i]
            with K.name_scope(self.output_names[i] + '_loss'):
                output_loss = weighted_loss(y_true, y_pred,
                                            sample_weight, mask)
            if len(self.outputs) > 1:
                self.metrics_tensors.append(output_loss)
                self.metrics_names.append(self.output_names[i] + '_loss')
            if total_loss is None:
                total_loss = loss_weight * output_loss
            else:
                total_loss += loss_weight * output_loss

However, I noticed that when I train a network with a batch_size=32 and a batch_size=64, the loss value for every epoch still comes out to more or less the same with only a ~0.05% difference. However, the accuracy for both networks remained the exact same. So essentially, the batch size didn't have too much effect on the network.

My question is when I double the batch size, assuming the loss is really being summed, shouldn't the loss in fact be double the value it was previously, or at least greater? The excuse that the network probably learned better with the bigger batch size is negated by the fact the accuracy has stayed exactly the same.

The fact that the loss stays more or less the same regardless of the batch size makes me think it's being averaged.

Dahabeah answered 4/9, 2018 at 19:30 Comment(6)
The loss is the average, not the sum of the individual losses.Cason
Can you please confirm this through the code?Dahabeah
@Cason When i followed the code for fit() it seems to average but compile() seems to sum. Why is there both?Dahabeah
See here: github.com/keras-team/keras/blob/master/keras/losses.py All of the losses have K.mean() wrapped around them showing you that it's the average and not the sum.Cason
@Cason see comment to the accepted answer.Dahabeah
My understanding could be flawed. I will have to take a look at a later time as I don't have time right now.Cason
J
8

The code you have posted concerns multi-output models where each output may have its own loss and weights. Hence, the loss values of different output layers are summed together. However, The individual losses are averaged over the batch as you can see in the losses.py file. For example this is the code related to binary cross-entropy loss:

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)

Update: Right after adding the second part of the this answer (i.e. loss functions), as the OP, I was baffled by the axis=-1 in the definition of loss function and I thought to myself that it must be axis=0 to indicate the average over the batch?! Then I realized that all the K.mean() used in the definition of loss function are there for the case of an output layer consisting of multiple units. So where is the loss averaged over the batch? I inspected the code to find the answer: to get the loss value for a specific loss function, a function is called taking the true and predicted labels as well as the sample weights and mask as its inputs:

weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)

what is this weighted_losses[i] function? As you may find, it is an element of list of (augmented) loss functions:

weighted_losses = [
    weighted_masked_objective(fn) for fn in loss_functions]

fn is actually one of the loss functions defined in losses.py file or it may be a user-defined custom loss function. And now what is this weighted_masked_objective function? It has been defined in training_utils.py file:

def weighted_masked_objective(fn):
    """Adds support for masking and sample-weighting to an objective function.
    It transforms an objective function `fn(y_true, y_pred)`
    into a sample-weighted, cost-masked objective function
    `fn(y_true, y_pred, weights, mask)`.
    # Arguments
        fn: The objective function to wrap,
            with signature `fn(y_true, y_pred)`.
    # Returns
        A function with signature `fn(y_true, y_pred, weights, mask)`.
    """
    if fn is None:
        return None

    def weighted(y_true, y_pred, weights, mask=None):
        """Wrapper function.
        # Arguments
            y_true: `y_true` argument of `fn`.
            y_pred: `y_pred` argument of `fn`.
            weights: Weights tensor.
            mask: Mask tensor.
        # Returns
            Scalar tensor.
        """
        # score_array has ndim >= 2
        score_array = fn(y_true, y_pred)
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in Theano
            mask = K.cast(mask, K.floatx())
            # mask should have the same shape as score_array
            score_array *= mask
            #  the loss per batch should be proportional
            #  to the number of unmasked samples.
            score_array /= K.mean(mask)

        # apply sample weighting
        if weights is not None:
            # reduce score_array to same ndim as weight array
            ndim = K.ndim(score_array)
            weight_ndim = K.ndim(weights)
            score_array = K.mean(score_array,
                                 axis=list(range(weight_ndim, ndim)))
            score_array *= weights
            score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
        return K.mean(score_array)
return weighted

As you can see, first the per sample loss is computed in the line score_array = fn(y_true, y_pred) and then at the end the average of the losses is returned, i.e. return K.mean(score_array). So that confirms that the reported losses are the average of per sample losses in each batch.

Note that K.mean(), in case of using Tensorflow as backend, calls the tf.reduce_mean() function. Now, when K.mean() is called without an axis argument (the default value of axis argument would be None), as it is called in weighted_masked_objective function, the corresponding call to tf.reduce_mean() computes the mean over all the axes and returns one single value. That's why no matter the shape of output layer and the loss function used, only one single loss value is used and reported by Keras (and it should be like this, because optimization algorithms need to minimize a scalar value, not a vector or tensor).

Jim answered 4/9, 2018 at 20:52 Comment(9)
Hmm, but this doesn't quite gel with I noticed in this question: #52035483Dahabeah
The reason it doesn't gel is because the axis=-1. And so, when prediction itself is an image, the axis=-1 is just a dimension of an image, and it's not really taking the mean over the batch in that case.Dahabeah
@Dahabeah I was suspicious same as you. See my updated answer.Jim
Ahhh that makes a lot more sense. So, to confirm, the axis=-1 is to average the loss for the different classes within a single sample?Dahabeah
Also, I too thought that the axis=-1 is for when the output layer consists of multiple units. However, this is not the case, I believe. Consider an FCN that is doing multi-label prediction for 10 classes. If the input RGB image is 256x256x3, then the output is 256x256x10 (i.e. 10 segmented images, each one illuminating a different class).Dahabeah
Now, during training, each of these segmented images are being passed into the loss function, which means it's being passed to the loss function 10 times per sample which also means the dimensions being passed into the loss function are 256x256. Therefore, axis=-1 would simply take the average rows and produce 256 values; however, Keras produces a SINGLE loss value for THE ENTIRE SAMPLE. Hence the premise of the question I linked in the previous comment.Dahabeah
So, somehow, binary_crossentropy() from Keras is taking the mean of the entire matrix - 256x256 in this example - (even though it just specifies a single axis), and then it again takes the mean (or sum? not sure, but once again, the reason for the linked question) of multiple matrices - 10 in this example, to produce a single loss per sample, which is then averaged as you proved above.Dahabeah
@Dahabeah K.mean calls the tf.reduce_mean. Now, when K.mean without an axis argument is called (the default value of axis would be None), as it is in weighted_masked_objective, the tf.reduce_mean computes the mean across all the axes and returns only one single value. I have updated my answer to reflect this point.Jim
Thank you, this does answer this question. But it still leaves me wondering how the losses for the multiple outputs per sample are combined. Take a look at this example: medium.com/nanonets/…. Each of the different labels can be thought of as multiple outputs that are being predicted for each input. Somehow those losses are being combined. It doesn't seem like they are being summed or averaged.Dahabeah
D
3

I would like to summarize the brilliant answers in this page.

  1. Certainly a model need a scalar value to optimize(i.e. Gradient Decent).
  2. This important value is calculated on batch level.(if you set batch size=1, it is stochastic gradient descent mode. so the gradient is calculated on that data point)
  3. In loss function, group aggregation function such as k.mean(), is specially activited on problems such as multi-classification, where to get one datapoint loss, we need sum many scalars along many labels.
  4. In the loss history printed by model.fit, the loss value printed is a running average on each batch. So the value we see is actually a estimated loss scaled for batch_size*per datapoint.

  5. Be aware that even if we set batch size=1, the printed history may use a different batch interval for print. In my case:

    self.model.fit(x=np.array(single_day_piece),y=np.array(single_day_reward),batch_size=1)
    

The print is:

 1/24 [>.............................] - ETA: 0s - loss: 4.1276
 5/24 [=====>........................] - ETA: 0s - loss: -2.0592
 9/24 [==========>...................] - ETA: 0s - loss: -2.6107
13/24 [===============>..............] - ETA: 0s - loss: -0.4840
17/24 [====================>.........] - ETA: 0s - loss: -1.8741
21/24 [=========================>....] - ETA: 0s - loss: -2.4558
24/24 [==============================] - 0s 16ms/step - loss: -2.1474

In my problem, there is no way a single datapoint loss could reach scale of 4.xxx.So I guess model take sum loss of first 4 datapoints. However,the batch size for tain is not 4.

Dogma answered 20/1, 2020 at 8:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.