In the last few layers of sequence classification by HuggingFace, they took the first hidden state of the sequence length of the transformer output to be used for classification.
hidden_state = distilbert_output[0] # (bs, seq_len, dim) <-- transformer output
pooled_output = hidden_state[:, 0] # (bs, dim) <-- first hidden state
pooled_output = self.pre_classifier(pooled_output) # (bs, dim)
pooled_output = nn.ReLU()(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)
logits = self.classifier(pooled_output) # (bs, dim)
Is there any benefit to taking the first hidden state over the last, average, or even the use of a Flatten layer instead?