keras combining two losses with adjustable weights where the outputs do not have the same dimensionality

Asked 10/12, 2018 at 13:56 Answered 17/12, 2018 at 11:56

My question is similar to the one posed here: keras combining two losses with adjustable weights

However, the outputs have a different dimensionality resulting in the outputs not being able to be concatenated. Hence, the solution is not applicable, is there another way to solve this problem?

The question:

I have a keras functional model with two layers with outputs x1 and x2.

x1 = Dense(1,activation='relu')(prev_inp1)

x2 = Dense(2,activation='relu')(prev_inp2)

I need to use these x1 and x2 use them in a weighted loss function like in the attached image. Propagate the 'same loss' into both branches. Alpha is flexible to vary with iterations.

Mastiff answered 10/12, 2018 at 13:56 Comment(11)

Is alpha trainable? – Ramakrishna 10/12, 2018 at 17:55

@DanielMöller Ideally, using callback that should be possible. – Mastiff 10/12, 2018 at 19:5

Yes, but should it be "trainable" or are you adjusting it your own way? – Ramakrishna 10/12, 2018 at 19:19

@DanielMöller, yes trainable by gradient descent + backprop – Mastiff 11/12, 2018 at 7:38

@Mastiff Do you have labels for alpha? Otherwise I don't see how it could be trained - it has no gradient – Splat 11/12, 2018 at 10:1

d_loss/d_alpha can be derived from that loss function. – Mastiff 11/12, 2018 at 12:11

Sure, but I meant d_loss_cls/d_alpha and d_loss_loc/d_alpha. Why would you want to compute d_loss/d_alpha? You could, of course, try to minimize the total loss with respect to alpha, but why would this be good for the models' weights? – Splat 11/12, 2018 at 22:49

@Olivier_s_j, I changed my answer because the previous one was very bad. Alpha would always tend to 0 or 1. Now it tends to balance the values of the two losses. – Ramakrishna 11/11, 2020 at 0:36

@rvinas, when a loss is too big because of the formula (not necessarly because the model is "wronger" for this loss than for the other), the model will be updated too much regarding this loss, not updating the other. For certain models, this will bring a bad result. Training alpha to balance the two losses means the model will take both losses into consideration regardless of their values. This can be better. – Ramakrishna 11/11, 2020 at 0:38

@DanielMöller let's suppose that optimizing loss 1 is much easier than optimizing loss 2. In this case, wouldn't the model push the weight of loss 2 to 0? This would be an easy way to minimize the loss by "training" alpha, but not necessarily the desired behavior. Does it make sense? (wow, it's been a long time since the last comment!) – Splat 11/11, 2020 at 9:14

@rvinas, with the new formula, in the answer I edited yesterday, it's impossible for alpha to go to zero. It will always find a point that makes both losses equal. (Must have a "zero" loss for alpha to go to zero) - But yes, training the easiest loss more than the hardest loss may not be the best option for certain models - On the other hand, take into account that the more you train one loss, the harder it gets, and the more the other loss becomes relatively easier. – Ramakrishna 11/11, 2020 at 13:0

For this question, a more elaborated solution is necessary. Since we're going to use a trainable weight, we will need a custom layer.

Also, we will be needing a different form of training, since our loss doesn't work like the others taking only y_true and y_pred and considers joining two different outputs.

Thus, we're going to create two versions of the same model, one for prediction, another for training, and the training version will contain the loss in itself, using a dummy keras loss function in compilation.

The prediction model

Let's use a very basic example of model with two outputs and one input:

#any input your true model takes
inp = Input((5,5,2))

#represents the localization output
outImg = Conv2D(1,3,activation='sigmoid')(inp)

#represents the classification output
outClass = Flatten()(inp)
outClass = Dense(2,activation='sigmoid')(outClass)

#the model
predictionModel = Model(inp, [outImg,outClass])

You use this one regularly for predictions. It's not necessary to compile this one.

The losses for each branch

Now, let's create custom loss functions for each branch, one for LossCls and another for LossLoc.

Using dummy examples here, you can elaborate these losses better if necessary. The most important is that they output batches shaped like (batch, 1) or (batch,). Both output the same shape so they can be summed later.

def calcImgLoss(x):
    true,pred = x
    loss = binary_crossentropy(true,pred)
    return K.mean(loss, axis=[1,2])

def calcClassLoss(x):
    true,pred = x
    return binary_crossentropy(true,pred)

These will be used in Lambda layers in the training model.

The loss weighting layer - (WARNING! EDITED! - See explanation at the end)

Now, let's weight the losses with the trainable alpha. Trainable parameters need custom layers to be implemented.

class LossWeighter(Layer):
    def __init__(self, **kwargs): #kwargs can have 'name' and other things
        super(LossWeighter, self).__init__(**kwargs)

    #create the trainable weight here, notice the constraint between 0 and 1
    def build(self, inputShape):
        self.weight = self.add_weight(name='loss_weight', 
                                     shape=(1,),
                                     initializer=Constant(0.5), 
                                     constraint=Between(0,1),
                                     trainable=True)
        super(LossWeighter,self).build(inputShape)

    def call(self,inputs):
        #old answer: will always tend to completely ignore the biggest loss
        #return (self.weight * firstLoss) + ((1-self.weight)*secondLoss)
        #problem: alpha tends to 0 or 1, eliminating the biggest of the two losses

        #proposal of working alpha optimization
        #return K.square((self.weight * firstLoss) - ((1-self.weight)*secondLoss))
        #problem: might not train any of the losses, and even increase one of them
                 #in order to minimize the difference between the two losses

        #new answer - a mix between the two, applying gradients to the right weights
        loss1, loss2 = inputs                 #trainable
        static_loss1 = K.stop_gradient(loss1) #non_trainable
        static_loss2 = K.stop_gradient(loss2) #non_trainable
    
        a1 = self.weight                      #trainable
        a2 = 1 - a1                           #trainable
        static_a1 = K.stop_gradient(a1)       #non_trainable
        static_a2 = 1 - static_a1             #non_trainable
    
    
        #this trains only alpha to minimize the difference between both losses  
        alpha_loss = K.square((a1 * static_loss1) - (a2 * static_loss2))
                 #or K.abs   (.....)
             
        #this trains only the original model weights to minimize both original losses
        model_loss = (static_a1 * loss1) + (static_a2 * loss2)
    
        return alpha_loss + model_loss

    def compute_output_shape(self,inputShape):
        return inputShape[0]

Notice that there is a custom constraint to keep this weight between 0 and 1. This constraint is implemented with:

class Between(Constraint):
    def __init__(self,min_value,max_value):
        self.min_value = min_value
        self.max_value = max_value

    def __call__(self,w):
        return K.clip(w,self.min_value, self.max_value)

    def get_config(self):
        return {'min_value': self.min_value,
                'max_value': self.max_value}

The training model

This model will take the prediction model as base, add the loss calculations and loss weighter at the end and output only the loss value. Because it outputs only a loss, we will use the true targets as inputs, and a dummy loss function defined like:

def ignoreLoss(true,pred):
    return pred #this just tries to minimize the prediction without any extra computation

Model inputs:

#true targets
trueImg = Input((3,3,1))
trueClass = Input((2,))

#predictions from the prediction model
predImg = predictionModel.outputs[0]
predClass = predictionModel.outputs[1]

Model outputs = losses:

imageLoss = Lambda(calcImgLoss, name='loss_loc')([trueImg, predImg])
classLoss = Lambda(calcClassLoss, name='loss_cls')([trueClass, predClass])
weightedLoss = LossWeighter(name='weighted_loss')([imageLoss,classLoss])

Model:

trainingModel = Model([predictionModel.input, trueImg, trueClass], weightedLoss)
trainingModel.compile(optimizer='sgd', loss=ignoreLoss)

Dummy training

inputImages = np.zeros((7,5,5,2))
outputImages = np.ones((7,3,3,1))
outputClasses = np.ones((7,2))
dummyOut = np.zeros((7,))

trainingModel.fit([inputImages,outputImages,outputClasses], dummyOut, epochs = 50)
predictionModel.predict(inputImages)

Necessary imports

from keras.layers import *
from keras.models import Model
from keras.constraints import Constraint
from keras.initializers import Constant
from keras.losses import binary_crossentropy #or another you need

(EDIT) Explaining the problem with the old answer:

The formula used in the old answer would make alpha always go to 0 or 1, meaning only the smallest of the two losses would be ever trained. (Useless)

A new formula leads alpha to make both losses have the same value. Alpha would be trained properly and not tend to 0 or 1. But, still, the losses would not be properly trained because "increasing one loss to reach the other" would be a possibility for the model, and once both losses were equal, the model would stop training.

The new solution is a mix of the two proposals above, while the first actually trains the losses but with wrong alpha; and the second trains alpha with wrong losses. The mixed solution adds both, but uses K.stop_gradient to prevent the wrong part of the training from happening.

The result of this will be: the "easiest" loss (not the biggest) will be more trained than the hardest. We may use K.abs or K.square, as compared to "mae" or "mse" between the two losses. The best option is up to experiment.

See this table comparing the old and new proposals:

This does not guarantee the best optimization though!!!

Training the easiest loss will not always have the best result, though. It may be better than favoring a huge loss just because it's formula is different. But the expected result might still need some manual weighting of the losses.

I fear there is no automatic training for this weight. If you have a target metric, you can try to train this metric (when possible, but metrics that depend on sorting, getting an index, rounding or anything that breaks backpropagation may not be possible to be transformed in losses).

Gratifying answered 17/12, 2018 at 11:56 Comment(0)

There is no need to concatenate your outputs. To pass multiple arguments to a loss function, you can wrap it as follows:

def custom_loss(x1, x2, y1, y2, alpha):
    def loss(y_true, y_pred):
        return (1-alpha) * loss_cls(y1, x1) + alpha * loss_loc(y2, x2)
    return loss

And then compile your functional model as:

x1 = Dense(1, activation='relu')(prev_inp1)
x2 = Dense(2, activation='relu')(prev_inp2)
y1 = Input((1,))
y2 = Input((2,))

model.compile('sgd',
              loss=custom_loss(x1, x2, y1, y2, 0.5),
              target_tensors=[y1, y2])

NOTE: Not tested.

Splat answered 11/12, 2018 at 9:57 Comment(2)

In keras the loss is applied for each output, so y_true will not contain the labels for the respective ouputs (i.e. y_true[0] and y_true[1]). Or am I missing something? – Mastiff 11/12, 2018 at 12:32

You're right, I have updated my answer. In this case, you could also pass the labels as arguments to custom_loss. This will require using the argument target_tensors of model.compile. – Splat 11/12, 2018 at 22:56