How is WordPiece tokenization helpful to effectively deal with rare words problem in NLP?
Asked Answered
A

2

59

I have seen that NLP models such as BERT utilize WordPiece for tokenization. In WordPiece, we split the tokens like playing to play and ##ing. It is mentioned that it covers a wider spectrum of Out-Of-Vocabulary (OOV) words. Can someone please help me explain how WordPiece tokenization is actually done, and how it handles effectively helps to rare/OOV words?

Azriel answered 27/3, 2019 at 16:52 Comment(0)
I
76

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary.

Consider the WordPiece algorithm from the original paper (wording slightly modified by me):

  1. Initialize the word unit inventory with all the characters in the text.
  2. Build a language model on the training data using the inventory from 1.
  3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
  4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.

The BPE algorithm only differs in Step 3, where it simply chooses the new word unit as the combination of the next most frequently occurring pair among the current set of subword units.

Example

Input text: she walked . he is a dog walker . i walk

First 3 BPE Merges:

  1. w a = wa
  2. l k = lk
  3. wa lk = walk

So at this stage, your vocabulary includes all the initial characters, along with wa, lk, and walk. You usually do this for a fixed number of merge operations.

How does it handle rare/OOV words?

Quite simply, OOV words are impossible if you use such a segmentation method. Any word which does not occur in the vocabulary will be broken down into subword units. Similarly, for rare words, given that the number of subword merges we used is limited, the word will not occur in the vocabulary, so it will be split into more frequent subwords.

How does this help?

Imagine that the model sees the word walking. Unless this word occurs at least a few times in the training corpus, the model can't learn to deal with this word very well. However, it may have the words walked, walker, walks, each occurring only a few times. Without subword segmentation, all these words are treated as completely different words by the model.

However, if these get segmented as walk@@ ing, walk@@ ed, etc., notice that all of them will now have walk@@ in common, which will occur much frequently while training, and the model might be able to learn more about it.

Indulge answered 29/3, 2019 at 11:56 Comment(5)
It's not exactly true that OOV words are impossible. Imagine you build the model where all training data is ascii characters. At decode time, you pass in a Chinese character that was never seen before. It will be OOV. OOVs are going to be rather rare, but not impossible.Hospitality
Right, thanks for the clarification. My assumption was that the segmentation is learned on the same text that is to be segmented, but that is of course not always true.Indulge
Could you please elaborate on Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model. ? I see it quoted everywhere but no-one goes into detail what this means exactly.Whaleboat
The likelihood on the training data is based on frequency, meaning that you decide on the new word unit, by counting how many times all the combinations appears in the training set. The most frequent combination is added and the process is repeated. @WhaleboatLesotho
@Lesotho I think that's only the case for the BPE algorithm, not for WordPiece. I think the author refers to the likelihood of the LM.Stuartstub
S
10

WordPiece is very similar to BPE.

As an example, let’s assume that after pre-tokenization, the following set of words including their frequency has been determined:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

Consequently, the base vocabulary is ["b", "g", "h", "n", "p", "s", "u"]. Splitting all words into symbols of the base vocabulary, we obtain:

("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)

BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In the example above "h" followed by "u" is present 10 + 5 = 15 times (10 times in the 10 occurrences of "hug", 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is "u" followed by "g", occurring 10 + 5 + 5 = 20 times in total. Thus, the first merge rule the tokenizer learns is to group all "u" symbols followed by a "g" symbol together. Next, "ug" is added to the vocabulary. The set of words then becomes

("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)

BPE then identifies the next most common symbol pair. It’s "u" followed by "n", which occurs 16 times. "u", "n" is merged to "un" and added to the vocabulary. The next most frequent symbol pair is "h" followed by "ug", occurring 15 times. Again the pair is merged and "hug" can be added to the vocabulary.

At this stage, the vocabulary is ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] and our set of unique words is represented as

("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)

Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance, the word "bug" would be tokenized to ["b", "ug"] but "mug" would be tokenized as ["", "ug"] since the symbol "m" is not in the base vocabulary. In general, single letters such as "m" are not replaced by the "<unk>" symbol because the training data usually includes at least one occurrence of each letter, but it is likely to happen for very special characters like emojis.

The base vocabulary size + the number of merges, is a hyper-parameter to choose. For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.

WordPiece Vs BPE

WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary. Referring to the previous example, maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to make ensure it’s worth it.

Also, BPE places the @@ at the end of tokens while wordpieces place the ## at the beginning.

Succor answered 30/11, 2021 at 16:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.