I am confused about how to format my own pre-trained weights for Keras Embedding
layer if I'm also setting mask_zero=True
. Here's a concrete toy example.
Suppose I have a vocabulary of 4 words [1,2,3,4]
and am using vector weights defined by:
weight[1]=[0.1,0.2]
weight[2]=[0.3,0.4]
weight[3]=[0.5,0.6]
weight[4]=[0.7,0.8]
I want to embed sentences of length up to 5 words, so I have to zero pad them before feeding them into the Embedding layer. I want to mask out the zeros so further layers don't use them.
Reading the Keras docs for Embedding, it says the 0 value can't be in my vocabulary.
mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).
So what I'm confused about is how to construct the weight array for the Embedding layer, since "index 0 cannot be used in the vocabulary." If I build the weight array as
[[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
then normally, word 1
would point to index 1, which in this case holds the weights for word 2
. Or is it that when you specify mask_zero=True
, Keras internally makes it so that word 1
points to index 0? Alternatively, do you just prepend a vector of zeros in index zero, as follows?
[[0.0,0.0],
[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
This second option seems to me to put the zero into the vocabulary. In other words, I'm very confused. Can anyone shed light on this?