How are the TokenEmbeddings in BERT created?

Asked 16/9, 2019 at 16:29 Answered 12/7, 2022 at 6:28

Solved machine-learning nlp word-embedding

In the paper describing BERT, there is this paragraph about WordPiece Embeddings.

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C 2 RH, and the final hidden vector for the ith input token as Ti 2 RH. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

As I understand, WordPiece splits Words into wordpieces like #I #like #swim #ing, but it does not generate Embeddings. But I did not find anything in the paper and on other sources how those Token Embeddings are generated. Are they pretrained before the actual Pre-training? How? Or are they randomly initialized?

Handspike answered 16/9, 2019 at 16:29 Comment(0)

The wordpieces are trained separately, such the most frequent words remain together and the less frequent words get split eventually down to characters.

The embeddings are trained jointly with the rest of BERT. The back-propagation is done through all the layers up to the embeddings which get updated just like any other parameters in the network.

Note that only the embeddings of tokens which are actually present in the training batch get updated and the rest remain unchanged. This also a reason why you need to have relatively small word-piece vocabulary, such that all embeddings get updated frequently enough during the training.

Brote answered 17/9, 2019 at 9:14 Comment(1)

In BERT, they have an aggregation input embedding (sum of token, positional and segment embedding), let's call it X, and then a weight matrix W. They do some calculation between X and W to get an output embedding. Now, during training, they update the weight matrix and also update the X? It sounds a bit difficult to understand for me. Could you please give me some more info on this? – Pitch 16/3, 2022 at 4:49

Firstly, the token ids are simply their index in the vocabulary. (Or, a specialized tokenizer can make a more complicated mapping, like including some special-token offset.)

Secondly, an embedding layer with trainable weights maps the IDs to d_model vectors, shape: (batch, n_ids) -> (batch, n_embeddings, d_model)

One answerer here gave an example, but it is not clearly stated that the number is the index of the vocabulary:

BERT’s input is essentially subwords. 
For example, if I want to feed BERT the sentence 
“Welcome to HuggingFace Forums!”, what I actually gets fed in is:
['[CLS]', 'welcome', 'to', 'hugging', '##face', 'forums', '!', '[SEP]'].

Each of these tokens is mapped to an integer:
[101, 6160, 2000, 17662, 12172, 21415, 999, 102].

Then I searched and downloaded the vocabulary (vocab.txt bert-base-uncased) and verified the above numbers.

Other link:
torch.nn.Embedding
How does nn.Embedding work? And, is an embedding layer essentially just a linear layer?

Disturb answered 12/7, 2022 at 6:28 Comment(0)

Recommended topics

Hot tags