How does mask_zero in Keras Embedding layer work?

Asked 25/11, 2017 at 11:3 Answered 8/4, 2020 at 13:50

Solved python machine-learning keras word-embedding

I thought mask_zero=True will output 0's when the input value is 0, so the following layers could skip computation or something.

How does mask_zero works?

Example:

data_in = np.array([
  [1, 2, 0, 0]
])
data_in.shape
>>> (1, 4)

# model
x = Input(shape=(4,))
e = Embedding(5, 5, mask_zero=True)(x)

m = Model(inputs=x, outputs=e)
p = m.predict(data_in)
print(p.shape)
print(p)

The actual output is: (the numbers are random)

(1, 4, 5)
[[[ 0.02499047  0.04617121  0.01586803  0.0338897   0.009652  ]
  [ 0.04782704 -0.04035913 -0.0341589   0.03020919 -0.01157228]
  [ 0.00451764 -0.01433611  0.02606953  0.00328832  0.02650392]
  [ 0.00451764 -0.01433611  0.02606953  0.00328832  0.02650392]]]

However, I thought the output will be:

[[[ 0.02499047  0.04617121  0.01586803  0.0338897   0.009652  ]
  [ 0.04782704 -0.04035913 -0.0341589   0.03020919 -0.01157228]
  [ 0 0 0 0 0]
  [ 0 0 0 0 0]]]

Daffy answered 25/11, 2017 at 11:3 Comment(4)

They're repeating the outputs of the last calculated steps. The documentation assures you that it's not "computing" them anymore. And since they're all the same for all the remaining steps, it's probably just a dummy repetition just to fill the shape of a numpy array. – Rimola 25/11, 2017 at 13:15

Interested to know why these are non-zero. How are they computed? – Obedient 24/11, 2018 at 16:10

There is an excellent write-up of how mask_zero works and what the propagated effects are in the tf.keras documentation tensorflow.org/guide/keras/masking_and_padding – Soemba 26/12, 2019 at 5:36

Ya except it appears that this only works when 'masking is supported' which is not the case in CNN layers... – Haydon 12/1, 2022 at 16:11

Actually, setting mask_zero=True for the Embedding layer does not result in returning a zero vector. Rather, the behavior of the Embedding layer would not change and it would return the embedding vector with index zero. You can confirm this by checking the Embedding layer weights (i.e. in the example you mentioned it would be m.layers[0].get_weights()). Instead, it would affect the behavior of the following layers such as RNN layers.

If you inspect the source code of Embedding layer you would see a method called compute_mask:

def compute_mask(self, inputs, mask=None):
    if not self.mask_zero:
        return None
    output_mask = K.not_equal(inputs, 0)
    return output_mask

This output mask will be passed, as the mask argument, to the following layers which support masking. This has been implemented in the __call__ method of base layer, Layer:

# Handle mask propagation.
previous_mask = _collect_previous_mask(inputs)
user_kwargs = copy.copy(kwargs)
if not is_all_none(previous_mask):
    # The previous layer generated a mask.
    if has_arg(self.call, 'mask'):
        if 'mask' not in kwargs:
            # If mask is explicitly passed to __call__,
            # we should override the default mask.
            kwargs['mask'] = previous_mask

And this makes the following layers to ignore (i.e. does not consider in their computations) this inputs steps. Here is a minimal example:

data_in = np.array([
  [1, 0, 2, 0]
])

x = Input(shape=(4,))
e = Embedding(5, 5, mask_zero=True)(x)
rnn = LSTM(3, return_sequences=True)(e)

m = Model(inputs=x, outputs=rnn)
m.predict(data_in)

array([[[-0.00084503, -0.00413611,  0.00049972],
        [-0.00084503, -0.00413611,  0.00049972],
        [-0.00144554, -0.00115775, -0.00293898],
        [-0.00144554, -0.00115775, -0.00293898]]], dtype=float32)

As you can see the outputs of the LSTM layer for the second and forth timesteps are the same as the output of first and third timesteps, respectively. This means that those timesteps have been masked.

Update: The mask will also be considered when computing the loss since the loss functions are internally augmented to support masking using weighted_masked_objective:

def weighted_masked_objective(fn):
    """Adds support for masking and sample-weighting to an objective function.
    It transforms an objective function `fn(y_true, y_pred)`
    into a sample-weighted, cost-masked objective function
    `fn(y_true, y_pred, weights, mask)`.
    # Arguments
        fn: The objective function to wrap,
            with signature `fn(y_true, y_pred)`.
    # Returns
        A function with signature `fn(y_true, y_pred, weights, mask)`.
    """

when compiling the model:

weighted_losses = [weighted_masked_objective(fn) for fn in loss_functions]

You can verify this using the following example:

data_in = np.array([[1, 2, 0, 0]])
data_out = np.arange(12).reshape(1,4,3)

x = Input(shape=(4,))
e = Embedding(5, 5, mask_zero=True)(x)
d = Dense(3)(e)

m = Model(inputs=x, outputs=d)
m.compile(loss='mse', optimizer='adam')
preds = m.predict(data_in)
loss = m.evaluate(data_in, data_out, verbose=0)
print(preds)
print('Computed Loss:', loss)

[[[ 0.009682    0.02505393 -0.00632722]
  [ 0.01756451  0.05928303  0.0153951 ]
  [-0.00146054 -0.02064196 -0.04356086]
  [-0.00146054 -0.02064196 -0.04356086]]]
Computed Loss: 9.041069030761719

# verify that only the first two outputs 
# have been considered in the computation of loss
print(np.square(preds[0,0:2] - data_out[0,0:2]).mean())

9.041070036475277

Propane answered 25/11, 2018 at 18:11 Comment(7)

Thank you, so what happens at the evaluation of the model. Does it mean we need to shift our output vector by 1? binary classification i.e. when y in 0,1. Or assuming the loss is computed with the mask, how do we actually evaluate such generator at the end? When we run predictions, we still get an output y, for which we need to manually fit the mask? For example, if we pad sequences to length 100, the y is always 100, but the real sequences are of variable length. How do we get model.predict() to return these variable lengths? – Obedient 25/11, 2018 at 20:28

So what I'm trying to say, when I pass this output to the Dense layer, which doesn't support masking, it will still calculate some loss for the 1st and 3rd indeces in your example. And compare it with y 0s. How does one implement such Dense layers? – Obedient 25/11, 2018 at 20:58

@Obedient The loss will be computed according the mask as well. I have updated my answer. Please take a look. – Propane 25/11, 2018 at 21:19

Hi @today, Thanyou for your great answers. Do you have any suggestion on ignoring padded/missing timesteps for decoder in AE with multiple features ? – Heptarchy 13/6, 2021 at 15:10

You said: "the second and forth timesteps are the same as the output of first and third timesteps, respectively" but what I see is that the 1st and the 2nd are the same and also the 3rd and the 4th. Ain't that true? – Nereen 10/2, 2023 at 13:29

@AminShn That's exactly what I am saying :) Note the "respectively" at the end of my sentence; i.e. 2nd output == 1st output, and 4th output == 3rd output (I am NOT saying 2nd == 4th or 1st == 3rd).... I am not sure but maybe I could have written it more clearly. – Propane 11/2, 2023 at 22:39

By the way, I ran your code and the computed losses are different!! – Nereen 14/2, 2023 at 11:48

The process of informing the Model that some part of the Data is actually Padding and should be ignored is called Masking.

There are three ways to introduce input masks in Keras models:

Add a keras.layers.Masking layer.
Configure a keras.layers.Embedding layer with mask_zero=True.
Pass a mask argument manually when calling layers that support this argument (e.g. RNN layers).

Given below is the code to introduce Input Masks using keras.layers.Embedding

import numpy as np

import tensorflow as tf

from tensorflow.keras import layers

raw_inputs = [[83, 91, 1, 645, 1253, 927],[73, 8, 3215, 55, 927],[711, 632, 71]]
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs,
                                                              padding='post')

print(padded_inputs)

embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padded_inputs)

print(masked_output._keras_mask)

Output of the above code is shown below:

[[  83   91    1  645 1253  927]
 [  73    8 3215   55  927    0]
 [ 711  632   71    0    0    0]]

tf.Tensor(
[[ True  True  True  True  True  True]
 [ True  True  True  True  True False]
 [ True  True  True False False False]], shape=(3, 6), dtype=bool)

For more information, refer this Tensorflow Tutorial.

Brunette answered 8/4, 2020 at 13:50 Comment(0)

Recommended topics

Hot tags