BPE multiple ways to encode a word

Asked 5/8, 2020 at 11:7 Answered 14/9, 2022 at 20:32

merge nlp tokenize bert-language-model huggingface-tokenizers

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", "ke", "en"). Then the word "token" could be encoded as ("to", "ke", "n") or ("to", "k", "en"). Such ambiguous encodings are also mentioned in this tutorial https://blog.floydhub.com/tokenization-nlp/

However, in the hugginface tutorial it is mentioned that "BPE and WordPiece [...] work out rules in a certain order that you can then apply in the same order when tokenizing new text", see https://huggingface.co/transformers/master/tokenizer_summary.html.

How exactly are these rules stored and applied when using BPE/WordPiece, e.g., in my example above, how is it determined which tokenization to use?

Steiger answered 5/8, 2020 at 11:7 Comment(4)

It just means you may use a BPE or a WordPiece (or SentencePiece) model to encode some text and then decode to obtain the original text. If you are training from scratch choose any, when you train incrementally, you will need to apply the same tokenization scheme. – Jebel 5/8, 2020 at 11:12

Ok thanks, but lets say I have used BPE/WordPiece for pre-processing and then trained a language model like GPT or BERT. Now I apply the trained model to a new text, which contains an ambiguous word ("token" in my example). It obv makes a difference now how this word is processed, regarding the prediction made by the model. So how is the encoding of the word determined? – Steiger 5/8, 2020 at 11:21

My guess is that BPE/WordPiece always use the largest units possible. However, sometimes all possible subword tokenizations might have the same length (e.g., as in my example) – Steiger 5/8, 2020 at 11:44

I do not think you can have the same word tokenized in a different way. Even if it is, it should not be a problem. – Jebel 5/8, 2020 at 11:49

The BPE algorithm learns the merge rules in a particular order based on the frequencies of subtokens. This ordering is used greedly during the encoding process for new text.

Considering the example from above, let's say the pair (e, n) appears 10 times in your training corpus, (t, o) appears 6 times and (k, e) appears 4 times. The BPE algorithm will learn 3 rules and apply them in the following order:

1. e, n -> en
2. t, o -> to
3. k, e -> ke

Encoding the new text token doesn't proceed from left to right, but instead it applies the rules based on their order. Therefore, the text will be encoded as follows:

Rule 1: t o k e n -> t o k en
Rule 2: t o k en -> t o k en
Rule 3: t o k en -> to k en

The Byte-Pair Encoding tokenization course on huggingface provides a reference implementation.

Isomerism answered 14/9, 2022 at 20:32 Comment(1)

It looks like you made a typo in your second line applying Rule 2. I believe it should be this: Rule 1: t o k e n -> t o k en Rule 2: t o k en -> to k en Rule 3: to k en -> to k en – Haberdashery 29/9, 2023 at 0:53

In the parsing step of BPE, the merging order matters. For instance, if the merging order is

(p, e), (pe, n), (pen, _), (a, p), (ap, p), (app, l), (appl, e), (apple, _), (pen, apple_)

Applepen PenapplePen should be segmented into this: [a, p, p, l, e, pe, pen, a, p, p, l, e, pen], given k = 2. We just use (p, e), (pe, n) for parsing. Since the merging order is fixed, the result should be fixed for the test data for any k. You just use the first k merges in the parsing step.

For the details please refer to my answer to the question: Explain bpe (Byte Pair Encoding) with examples?

Zircon answered 2/8, 2021 at 15:29 Comment(0)

Recommended topics

Hot tags