How to correctly use mask_zero=True for Keras Embedding with pre-trained weights?

I am confused about how to format my own pre-trained weights for Keras Embedding layer if I'm also setting mask_zero=True. Here's a concrete toy example.

Suppose I have a vocabulary of 4 words [1,2,3,4] and am using vector weights defined by:

weight[1]=[0.1,0.2]
weight[2]=[0.3,0.4]
weight[3]=[0.5,0.6]
weight[4]=[0.7,0.8]

I want to embed sentences of length up to 5 words, so I have to zero pad them before feeding them into the Embedding layer. I want to mask out the zeros so further layers don't use them.

Reading the Keras docs for Embedding, it says the 0 value can't be in my vocabulary.

mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

So what I'm confused about is how to construct the weight array for the Embedding layer, since "index 0 cannot be used in the vocabulary." If I build the weight array as

[[0.1,0.2],
 [0.3,0.4],
 [0.5,0.6],
 [0.7,0.8]]

then normally, word 1 would point to index 1, which in this case holds the weights for word 2. Or is it that when you specify mask_zero=True, Keras internally makes it so that word 1 points to index 0? Alternatively, do you just prepend a vector of zeros in index zero, as follows?

[[0.0,0.0],
 [0.1,0.2],
 [0.3,0.4],
 [0.5,0.6],
 [0.7,0.8]]

This second option seems to me to put the zero into the vocabulary. In other words, I'm very confused. Can anyone shed light on this?

You're second approach is correct. You will want to construct your embedding layer in the following way

embedding = Embedding(
   output_dim=embedding_size,
   input_dim=vocabulary_size + 1,
   input_length=input_length,
   mask_zero=True,
   weights=[np.vstack((np.zeros((1, embedding_size)),
                       embedding_matrix))],
   name='embedding'
)(input_layer)

where embedding_matrix is the second matrix you provided.

You can see this by looking at the implementation of keras' embedding layer. Notably, how mask_zero is only used to literally mask the inputs

def compute_mask(self, inputs, mask=None):
    if not self.mask_zero:
        return None
    output_mask = K.not_equal(inputs, 0)
    return output_mask

thus the entire kernel is still multiplied by the input, meaning all indexes are shifted up by one.

Recommended topics

Hot tags