How do I mask a loss function in Keras with the TensorFlow backend?

Asked 1/11, 2017 at 14:37 Answered 19/10, 2022 at 6:42

I am trying to implement a sequence-to-sequence task using LSTM by Keras with the TensorFlow backend. The inputs are English sentences with variable lengths. To construct a dataset with 2-D shape [batch_number, max_sentence_length], I add EOF at the end of the line and pad each sentence with enough placeholders, e.g. #. And then each character in the sentence is transformed into a one-hot vector, so that the dataset has 3-D shape [batch_number, max_sentence_length, character_number]. After LSTM encoder and decoder layers, softmax cross-entropy between output and target is computed.

To eliminate the padding effect in model training, masking could be used on input and loss function. Mask input in Keras can be done by using layers.core.Masking. In TensorFlow, masking on loss function can be done as follows: custom masked loss function in TensorFlow.

However, I don't find a way to realize it in Keras, since a user-defined loss function in Keras only accepts parameters y_true and y_pred. So how to input true sequence_lengths to loss function and mask?

Besides, I find a function _weighted_masked_objective(fn) in \keras\engine\training.py. Its definition is

Adds support for masking and sample-weighting to an objective function.

But it seems that the function can only accept fn(y_true, y_pred). Is there a way to use this function to solve my problem?

To be specific, I modify the example of Yu-Yang.

from keras.models import Model
from keras.layers import Input, Masking, LSTM, Dense, RepeatVector, TimeDistributed, Activation
import numpy as np
from numpy.random import seed as random_seed
random_seed(123)

max_sentence_length = 5
character_number = 3 # valid character 'a, b' and placeholder '#'

input_tensor = Input(shape=(max_sentence_length, character_number))
masked_input = Masking(mask_value=0)(input_tensor)
encoder_output = LSTM(10, return_sequences=False)(masked_input)
repeat_output = RepeatVector(max_sentence_length)(encoder_output)
decoder_output = LSTM(10, return_sequences=True)(repeat_output)
output = Dense(3, activation='softmax')(decoder_output)

model = Model(input_tensor, output)
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

X = np.array([[[0, 0, 0], [0, 0, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0]],
          [[0, 0, 0], [0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0]]])
y_true = np.array([[[0, 0, 1], [0, 0, 1], [1, 0, 0], [0, 1, 0], [0, 1, 0]], # the batch is ['##abb','#babb'], padding '#'
          [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0]]])

y_pred = model.predict(X)
print('y_pred:', y_pred)
print('y_true:', y_true)
print('model.evaluate:', model.evaluate(X, y_true))
# See if the loss computed by model.evaluate() is equal to the masked loss
import tensorflow as tf
logits=tf.constant(y_pred, dtype=tf.float32)
target=tf.constant(y_true, dtype=tf.float32)
cross_entropy = tf.reduce_mean(-tf.reduce_sum(target * tf.log(logits),axis=2))
losses = -tf.reduce_sum(target * tf.log(logits),axis=2)
sequence_lengths=tf.constant([3,4])
mask = tf.reverse(tf.sequence_mask(sequence_lengths,maxlen=max_sentence_length),[0,1])
losses = tf.boolean_mask(losses, mask)
masked_loss = tf.reduce_mean(losses)
with tf.Session() as sess:
    c_e = sess.run(cross_entropy)
    m_c_e=sess.run(masked_loss)
    print("tf unmasked_loss:", c_e)
    print("tf masked_loss:", m_c_e)

The output in Keras and TensorFlow are compared as follows:

As shown above, masking is disabled after some kinds of layers. So how to mask the loss function in Keras when those layers are added?

Ftlb answered 1/11, 2017 at 14:37 Comment(2)

Do you want a dynamic masking? – Hydrograph 1/11, 2017 at 18:21

@MarcinMożejko If ''dynamic masking" means masking the loss function according to the different input data of model, yes this is what I want. – Ftlb 2/11, 2017 at 8:6

If there's a mask in your model, it'll be propagated layer-by-layer and eventually applied to the loss. So if you're padding and masking the sequences in a correct way, the loss on the padding placeholders would be ignored.

Some Details:

It's a bit involved to explain the whole process, so I'll just break it down to several steps:

In compile(), the mask is collected by calling compute_mask() and applied to the loss(es) (irrelevant lines are ignored for clarity).

weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]

# Prepare output masks.
masks = self.compute_mask(self.inputs, mask=None)
if masks is None:
    masks = [None for _ in self.outputs]
if not isinstance(masks, list):
    masks = [masks]

# Compute total loss.
total_loss = None
with K.name_scope('loss'):
    for i in range(len(self.outputs)):
        y_true = self.targets[i]
        y_pred = self.outputs[i]
        weighted_loss = weighted_losses[i]
        sample_weight = sample_weights[i]
        mask = masks[i]
        with K.name_scope(self.output_names[i] + '_loss'):
            output_loss = weighted_loss(y_true, y_pred,
                                        sample_weight, mask)

Inside Model.compute_mask(), run_internal_graph() is called.
Inside run_internal_graph(), the masks in the model is propagated layer-by-layer from the model's inputs to outputs by calling Layer.compute_mask() for each layer iteratively.

So if you're using a Masking layer in your model, you shouldn't worry about the loss on the padding placeholders. The loss on those entries will be masked out as you've probably already seen inside _weighted_masked_objective().

A Small Example:

max_sentence_length = 5
character_number = 2

input_tensor = Input(shape=(max_sentence_length, character_number))
masked_input = Masking(mask_value=0)(input_tensor)
output = LSTM(3, return_sequences=True)(masked_input)
model = Model(input_tensor, output)
model.compile(loss='mae', optimizer='adam')

X = np.array([[[0, 0], [0, 0], [1, 0], [0, 1], [0, 1]],
              [[0, 0], [0, 1], [1, 0], [0, 1], [0, 1]]])
y_true = np.ones((2, max_sentence_length, 3))
y_pred = model.predict(X)
print(y_pred)
[[[ 0.          0.          0.        ]
  [ 0.          0.          0.        ]
  [-0.11980877  0.05803877  0.07880752]
  [-0.00429189  0.13382857  0.19167568]
  [ 0.06817091  0.19093043  0.26219055]]

 [[ 0.          0.          0.        ]
  [ 0.0651961   0.10283815  0.12413475]
  [-0.04420842  0.137494    0.13727818]
  [ 0.04479844  0.17440712  0.24715884]
  [ 0.11117355  0.21645413  0.30220413]]]

# See if the loss computed by model.evaluate() is equal to the masked loss
unmasked_loss = np.abs(1 - y_pred).mean()
masked_loss = np.abs(1 - y_pred[y_pred != 0]).mean()

print(model.evaluate(X, y_true))
0.881977558136

print(masked_loss)
0.881978

print(unmasked_loss)
0.917384

As can be seen from this example, the loss on the masked part (the zeroes in y_pred) is ignored, and the output of model.evaluate() is equal to masked_loss.

EDIT:

If there's a recurrent layer with return_sequences=False, the mask stop propagates (i.e., the returned mask is None). In RNN.compute_mask():

def compute_mask(self, inputs, mask):
    if isinstance(mask, list):
        mask = mask[0]
    output_mask = mask if self.return_sequences else None
    if self.return_state:
        state_mask = [None for _ in self.states]
        return [output_mask] + state_mask
    else:
        return output_mask

In your case, if I understand correctly, you want a mask that's based on y_true, and whenever the value of y_true is [0, 0, 1] (the one-hot encoding of "#") you want the loss to be masked. If so, you need to mask the loss values in a somewhat similar way to Daniel's answer.

The main difference is the final average. The average should be taken over the number of unmasked values, which is just K.sum(mask). And also, y_true can be compared to the one-hot encoded vector [0, 0, 1] directly.

def get_loss(mask_value):
    mask_value = K.variable(mask_value)
    def masked_categorical_crossentropy(y_true, y_pred):
        # find out which timesteps in `y_true` are not the padding character '#'
        mask = K.all(K.equal(y_true, mask_value), axis=-1)
        mask = 1 - K.cast(mask, K.floatx())

        # multiply categorical_crossentropy with the mask
        loss = K.categorical_crossentropy(y_true, y_pred) * mask

        # take average w.r.t. the number of unmasked entries
        return K.sum(loss) / K.sum(mask)
    return masked_categorical_crossentropy

masked_categorical_crossentropy = get_loss(np.array([0, 0, 1]))
model = Model(input_tensor, output)
model.compile(loss=masked_categorical_crossentropy, optimizer='adam')

The output of the above code then shows that the loss is computed only on the unmasked values:

model.evaluate: 1.08339476585
tf unmasked_loss: 1.08989
tf masked_loss: 1.08339

The value is different from yours because I've changed the axis argument in tf.reverse from [0,1] to [1].

Steato answered 1/11, 2017 at 17:47 Comment(16)

Thanks for the reply. Yes, this can be work when return_sequences=True in LSTM . However, in encoder-decoder model, the LSTM in encoder generally set return_sequences=False and use RepeatVector to repeat the output of the last unit, then the LSTM in decoder accept it. To be specific, I modify your small example to show the problem. I will show it by 'answer my question' below, since the comment can't be too long. – Ftlb 2/11, 2017 at 7:37

@Ftlb Ah, by seq2seq, I thought you mean models like the one in this example. I've updated the answer. Please see if that's what you want. – Steato 2/11, 2017 at 9:36

Firstly, many thanks to you . Yes, I want a mask that's based on y_true. I run your updated code, and it raises an error "ValueError: Dimensions must be equal, but are 5 and 3 for 'Equal' (op: 'Equal') with input shapes: [2,5,3], [3,1]." Is this caused by different versions or something else? – Ftlb 2/11, 2017 at 12:13

My bad. I've pasted the wrong code. It should work now. – Steato 2/11, 2017 at 13:34

There is still an error "ValueError: initial_value must have a shape specified: Tensor("dense_1_target:0", shape=(?, ?, ?), dtype=float32)". Maybe I make some mistake? – Ftlb 2/11, 2017 at 13:56

Umm, what are your Keras and TF versions? I test the code on Keras 2.0.9 + TF 1.3.0 and it works fine. Can you provide more information about the error? – Steato 2/11, 2017 at 14:7

My Keras is 2.0.4 and TF is 1.1.0. I just pull my code and error screenshot in "github.com/Shuaaai/Mask-loss-function". Could you check it? Thanks. – Ftlb 2/11, 2017 at 14:28

(1) There's a K.variable(y_true) in your loss function, please remove the K.variable(). (2) For Keras 2.0.4, it should be K.categorical_crossentropy(y_pred, y_true) instead of K.categorical_crossentropy(y_true, y_pred). The arguments are swapped in later versions. (3) The line mask = tf.reverse(...,[0,1]) should be mask = tf.reverse(...,[1]). You don't want to swap the samples (axis 0), right? – Steato 2/11, 2017 at 14:46

Yes, I make so many mistake... Thank you very much! My problem is solved. – Ftlb 2/11, 2017 at 15:4

Hi Yu-Yang, I have tried this masking method on the model "github.com/fchollet/keras/blob/master/examples/addition_rnn.py". I don't find explicit difference on training loss or validation loss between the original model and modified model under the same iteration number. Do you have any experience on this aspect? Thank you. – Ftlb 5/11, 2017 at 3:24

Sorry that I don't really have experience working with this model. At a first glance, I think you probably shouldn't mask out the "padding spaces" in this question. The model should learn to predict spaces if the answer contains padding spaces. Consider the example "12 + 34 = 46" versus "12 + 34 = 468", the latter is clearly wrong. The model should output 4, 6, and a padding space given the input "12+34". – Steato 6/11, 2017 at 2:57

To put it differently, if the model predicts quite well, the padding positions should not incur much loss. Then it's not so important whether you masked out the padding spaces or not. – Steato 6/11, 2017 at 3:3

Yes, the "12 + 34 = 468" will be a wrong prediction given "12 + 34 = 46". So I add EOF in answers (e.g., "46# ") and the model will learn EOF as well. Then the characters before EOF will be outputed and the padding spaces after EOF can be ignored. I guess the padding positions might have affected the model less if the model is strong enough to deal with the task, but I don't find relevent study or theoretical explanation. – Ftlb 6/11, 2017 at 6:32

Thanks for you clear answer. I have one more question: Dense layers seem to not support masking, so what happens if TimeDistributed(Dense()) is appended after an LSTM(require_sequences=True)? Will the mask be invalid? – Jareb 25/7, 2018 at 6:2

I believe Dense layers support masking. Since there's no compute_mask function implemented in Dense, by default the mask should just propagate through the layer, without modification. – Steato 25/7, 2018 at 8:28

@Steato what could be the equivalent of this in TensorFlow? – Boccioni 21/3, 2019 at 22:5

If you're not using masks as in Yu-Yang's answer, you can try this.

If you have your target data Y with length and padded with the mask value, you can:

import keras.backend as K
def custom_loss(yTrue,yPred):

    #find which values in yTrue (target) are the mask value
    isMask = K.equal(yTrue, maskValue) #true for all mask values

    #since y is shaped as (batch, length, features), we need all features to be mask values
    isMask = K.all(isMask, axis=-1) #the entire output vector must be true
        #this second line is only necessary if the output features are more than 1

    #transform to float (0 or 1) and invert
    isMask = K.cast(isMask, dtype=K.floatx())
    isMask = 1 - isMask #now mask values are zero, and others are 1

    #multiply this by the inputs:
       #maybe you might need K.expand_dims(isMask) to add the extra dimension removed by K.all
     yTrue = yTrue * isMask   
     yPred = yPred * isMask

     return someLossFunction(yTrue,yPred)

If you have padding only for the input data, or if Y has no length, you can have your own mask outside the function:

masks = [
   [1,1,1,1,1,1,0,0,0],
   [1,1,1,1,0,0,0,0,0],
   [1,1,1,1,1,1,1,1,0]
]
 #shape (samples, length). If it fails, make it (samples, length, 1). 

import keras.backend as K

masks = K.constant(masks)

Since masks depend on your input data, you can use your mask value to know where to put zeros, such as:

masks = np.array((X_train == maskValue).all(), dtype='float64')    
masks = 1 - masks

#here too, if you have a problem with dimensions in the multiplications below
#expand masks dimensions by adding a last dimension = 1.

And make your function taking masks from outside of it (you must recreate the loss function if you change the input data):

def customLoss(yTrue,yPred):

    yTrue = masks*yTrue
    yPred = masks*yPred

    return someLossFunction(yTrue,yPred)

Does anyone know if keras automatically masks the loss function?? Since it provides a Masking layer and says nothing about the outputs, maybe it does it automatically?

Contraoctave answered 1/11, 2017 at 15:47 Comment(6)

Daniel - this is a really poor answer. Masks on length are dynamically assigned to y_true and y_pred so you cannot define it outside - as such masks are changing. If you do this in a manner which your provided - this will end up in a constant mask - which is not something what OP expects. – Hydrograph 1/11, 2017 at 17:19

@MarcinMożejko, thank you very much. My answer was indeed a bad answer. – Cautious 1/11, 2017 at 17:36

Still not good compared to Yu-Yang's, but in case they don't use a masking layer, it may apply. – Cautious 1/11, 2017 at 18:20

If you define the custom loss inside your model function you could access the mask tensor still. So this answer is valid. – Thumbprint 28/1, 2019 at 13:13

@DanielMöller In your customLoss snipet: If the mask sets some yTrue and yPred values to zero, doesn't that mean that yTrue=yPred and the loss artificially goes up? – Janes 24/12, 2021 at 10:34

@Helen, the loss artificially goes down, but the really expected effect is the loss to be "constant" for those values, which mean they will never influence training. But adding some weight to the other elements to compensate for the zeros might be a good idea to balance things. – Cautious 13/1, 2022 at 11:27

I took both anwers and imporvised a way for Multiple Timesteps, single Missing target Values, Loss for LSTM(or other RecurrentNN) with return_sequences=True.

Daniels Answer would not suffice for multiple targets, due to isMask = K.all(isMask, axis=-1). Removing this aggregation made the function undifferentiable, probably. I do not know for shure, since I never run the pure function and cannot tell if its able to fit a model.

I fused Yu-Yangs's and Daniels answer together and it worked.


from tensorflow.keras.layers import Layer, Input, LSTM, Dense, TimeDistributed
from tensorflow.keras import Model, Sequential
import tensorflow.keras.backend as K
import numpy as np


mask_Value = -2
def get_loss(mask_value):
    mask_value = K.variable(mask_value)
    def masked_loss(yTrue,yPred):
        
        #find which values in yTrue (target) are the mask value
        isMask = K.equal(yTrue, mask_Value) #true for all mask values
    
        #transform to float (0 or 1) and invert
        isMask = K.cast(isMask, dtype=K.floatx())
        isMask = 1 - isMask #now mask values are zero, and others are 1
        isMask
        
        #multiply this by the inputs:
        #maybe you might need K.expand_dims(isMask) to add the extra dimension removed by K.all
        yTrue = yTrue * isMask   
        yPred = yPred * isMask
        
        # perform a root mean square error, whereas the mean is in respect to the mask
        mean_loss = K.sum(K.square(yPred - yTrue))/K.sum(isMask)
        loss = K.sqrt(mean_loss)
    
        return loss
        #RootMeanSquaredError()(yTrue,yPred)
        
    return masked_loss

# define timeseries data
n_sample = 10
timesteps = 5
feat_inp = 2
feat_out = 2

X = np.random.uniform(0,1, (n_sample, timesteps, feat_inp))
y = np.random.uniform(0,1, (n_sample,timesteps, feat_out))

# define model
model = Sequential()
model.add(LSTM(50, activation='relu',return_sequences=True, input_shape=(timesteps, feat_inp)))
model.add(Dense(feat_out))
model.compile(optimizer='adam', loss=get_loss(mask_Value))
model.summary()

# %%
model.fit(X, y, epochs=50, verbose=0)

Donnell answered 26/11, 2021 at 15:14 Comment(0)

Note that Yu-Yang's answer does not appear to work on Tensorflow Keras 2.7.0

Surprisingly, model.evaluate does not compute masked_loss or unmasked_loss. Instead, it assumes that the loss from all masked input steps is zero (but still includes those steps in the mean() calculation). This means that every masked timestep actually reduces the calculated error!

#%% Yu-yang's example
# https://mcmap.net/q/402689/-how-do-i-mask-a-loss-function-in-keras-with-the-tensorflow-backend
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
# Fix the random seed for repeatable results
np.random.seed(5)
tf.random.set_seed(5)

max_sentence_length = 5
character_number = 2

input_tensor = keras.Input(shape=(max_sentence_length, character_number))
masked_input = keras.layers.Masking(mask_value=0)(input_tensor)
output = keras.layers.LSTM(3, return_sequences=True)(masked_input)
model = keras.Model(input_tensor, output)
model.compile(loss='mae', optimizer='adam')

X = np.array([[[0, 0], [0, 0], [1, 0], [0, 1], [0, 1]],
              [[0, 0], [0, 1], [1, 0], [0, 1], [0, 1]]])
y_true = np.ones((2, max_sentence_length, 3))
y_pred = model.predict(X)
print(y_pred)

# See if the loss computed by model.evaluate() is equal to the masked loss
unmasked_loss = np.abs(1 - y_pred).mean()
masked_loss = np.abs(1 - y_pred[y_pred != 0]).mean()

print(f"model.evaluate= {model.evaluate(X, y_true)}")
print(f"masked loss= {masked_loss}")
print(f"unmasked loss= {unmasked_loss}")

Prints:

[[[ 0.          0.          0.        ]
  [ 0.          0.          0.        ]
  [ 0.05340272 -0.06415359 -0.11803789]
  [ 0.08775083  0.00600774 -0.10454659]
  [ 0.11212641  0.07632366 -0.04133942]]

 [[ 0.          0.          0.        ]
  [ 0.05394626  0.08956442  0.03843312]
  [ 0.09092357 -0.02743799 -0.10386454]
  [ 0.10791279  0.04083341 -0.08820333]
  [ 0.12459432  0.09971555 -0.02882453]]]
1/1 [==============================] - 1s 658ms/step - loss: 0.6865
model.evaluate= 0.6864957213401794
masked loss= 0.9807082414627075
unmasked loss= 0.986495852470398

(This is intended as a comment rather than an answer).

Cinema answered 19/10, 2022 at 6:42 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Some Details:

A Small Example:

EDIT:

Recommended topics

Hot tags