How is position wise feed forward neural network implemented for transformers?

Asked 2/1, 2023 at 5:59 Answered 27/6, 2024 at 2:25

machine-learning pytorch neural-network transformer-model

I am having hard time understanding position wise feed forward neural network in transformers architecture.

Lets take example as Machine translation task, where inputs are sentences. From the figure I understand that for each word, different feed forward neural network is used to the output of self attention sub-layer. The feed forward layer apply similar Linear transformations but actual weights and biases for each transformations are different because they are two different feed forward neural network.

refering to Link, Here is the class for PositionWiseFeedForward neural network

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

My question is:

I don't see anything position-wise about this. This is simple fully connected neural network with two layers. assuming x to be list of embedding of each word in a sentence, each word in a sentence is transformed by above layer using same set of weight and biases.(correct me if i am wrong)

I was expecting to find something like passing each word embedding to separate Linear layer which will have different weight and biases to achieve something similar to what is shown in the picture.

Votive answered 2/1, 2023 at 5:59 Comment(0)

I had the same doubt, but I indeed agree with the answer from @Rabin Adhikari.

In the provided implementation, the x that is passed to the forward method is a tensor of shape (batch_size, sequence_length, embedding_dimension), rather than a flattened version of it (with shape (batch_size, sequence_length * embedding_dimension)). The (same) feed-forward layer applies to the last dimension only (the embedding dimension) for each batch and for each position in the sequence, hence position-wise.

This explains the quote from the paper which is also in the answer below and in your question

While the linear transformations are the same across different positions, they use different parameters from (encoder) layer to (encoder) layer.

It is easy to see that feeding the position-wise feed-forward layer with a sequence made of a repetition of the same tokens (represented here by the trivial embedding obtained via torch.ones()) you get the same output embedding for each token.

feed_forward = PositionwiseFeedForward(d_model=5, d_ff=3, dropout=0)
input_embeddings = torch.ones(1, 10, 5)
ff_outputs = feed_forward(input_embeddings)

ff_outputs, ff_outputs.shape
# --> (tensor([[[-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252],
                [-0.5512, -0.3976,  0.4570,  0.5153,  0.4252]]],
                grad_fn=<ViewBackward0>), torch.Size([1, 10, 5]))

I would also like to report a quote from the Natural Language Processing with Transformers book

Note that a feed-forward layer such as nn.Linear is usually applied to a tensor of shape (batch_size, input_dim), where it acts on each element of the batch independently. This is actually true for any dimension except the last one, so when we pass a tensor of shape (batch_size, seq_len, hidden_dim) the layer is applied to all token embeddings of the batch and sequence independently.

Eventually, it is important to observe that this way you obtain a hidden state for each token in the batch, which makes the architecture very flexible.

Keeley answered 4/3, 2023 at 0:11 Comment(0)

The weights are shared for a layer. After self-attention, all the transformed vectors are assumed to be in the same vector space. So, the same type of transformation can be applied to each of those vectors. This intuition is also used in some other tasks like sequence labelling where each token share the same classification head. This reduces the parameters of the network and forces the self-attention to do the heavy lifting.

Quoting from the link you provided,

While the linear transformations are the same across different positions, they use different parameters from layer to layer.

The x passed in the forward function is a single word vector z_i, not a list.

Yogi answered 2/1, 2023 at 15:30 Comment(0)

It is indeed just a single feedforward network rather than a separate one for each position. I don’t know why the paper says “position-wise”. As you said, there’s nothing really position-wise here.

Tunnel answered 2/1, 2023 at 6:34 Comment(0)

I will answer intuitively why it is position-wise. According to the code definition in the question if you have batched input of dimensions (batch, sequence length, d_model) Note that the input dimension of the FF layer is d_model, this means that the same network is applied at the same time to every token on the features dimension which is d_model. The example in @amiola's answer above explains the concept numerically.

Alfaro answered 6/1, 2024 at 17:7 Comment(0)

position-wise is a logical concept, which means that each vector in the output matrix is only related to one vector of a token in the input, but the actual implementation still uses the ordinary matmul. This is no different from the calculation of attention. It's just that when calculating Q*K^T in attention, you can see that each vector in attention is related to all vectors of all tokens.

Arminius answered 27/6, 2024 at 2:25 Comment(0)

Recommended topics

Hot tags