Why does embedding vector multiplied by a constant in Transformer model?

Asked 8/7, 2019 at 8:12 Answered 11/6, 2024 at 9:26

python tensorflow deep-learning attention-model

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.

As section Positional encoding says:

Since this model doesn't contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.

The positional encoding vector is added to the embedding vector.

My understanding is to add positional encoding vector directly to embedding vector. But I found embedding vector multiplied by a constant when I looked at the code.

The code in section Encoder as follows:

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)


    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

We can see x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)) before x += self.pos_encoding[:, :seq_len, :].

So why does embedding vector multiplied by a constant before adding positional encoding in Transformer model?

Wind answered 8/7, 2019 at 8:12 Comment(7)

Seems weird indeed, it would make since if it was /= referring to the normalizing factor in equation (1) in the paper – Transitory 8/7, 2019 at 8:25

@Transitory The normalizing factor has been implemented in the document but is not here. The normalizing factor is part of self-attention should be after the addition of embedding vector and positional encoding, so I can't understand embedding vector multiplying by a constant. – Wind 8/7, 2019 at 8:45

I get ya, revised it again and it indeed seems like a mistake – Transitory 8/7, 2019 at 9:8

@Transitory I found that tensorflow's official code also uses this method of calculation. The description of the offical code is #Scale embedding by the sqrt of the hidden size. – Wind 11/7, 2019 at 2:56

Did you find any rationale about why is it done? – Transitory 11/7, 2019 at 5:55

@Transitory No, I haven't found any theoretical explanation. – Wind 11/7, 2019 at 6:18

@Wind I found a possible explanation here datascience.stackexchange.com/a/88159/113304. Please refer – Steady 14/3, 2021 at 11:34

Looking around it, I found this argument 1:

The reason we increase the embedding values before the addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.

Cripple answered 11/5, 2020 at 21:2 Comment(1)

Interesting... because I chanced upon this question while googling an answer for - how do I ensure that the positional embeddings (which are a relatively weak signal in my model) do not significantly mess up with the actual input data embeddings. Was thinking of reducing dimensions of positions etc. But a scalar multiplication to magnify the I/p data embeddings seems to be a better option. It seems logical. But is this the reason? cfb below seems to think otherwise – Vandervelde 25/12, 2020 at 10:12

I believe the reason for this scaling has nothing to do with the scale applied at the attention layers. It is likely because the transformer shares the weights of the embedding layer and the output softmax. The scales you would use for the embeddings is different than the scale you use for a fully connected layer.

Some implementations of the transformer use this scaling even though they don't actually share the embedding weights at the output layer, but that is probably kept there for consistency (or by mistake). Just make sure that the initialization of your embeddings is consistent.

Vein answered 7/1, 2020 at 1:16 Comment(0)

There is a formula in "Attention Is All You Need" that is easily missed as it is only presented only as text (missed it myself, even when searching for explanation of the multiplication).

In the embedding layers, we multiply those weights by √dmodel.

see 3.4 Embeddings and Softmax

Demarco answered 11/6, 2024 at 9:26 Comment(0)

Recommended topics

Hot tags