Explain bpe (Byte Pair Encoding) with examples?

Asked 29/5, 2018 at 11:28 Answered 2/8, 2021 at 15:18

Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet.

What I have known so far is that it enables NMT model translation on open-vocabulary by encoding rare and unknown words as sequences of subword units.

But I want to get a general idea of how it works without going through the paper.

Lauritz answered 29/5, 2018 at 11:28 Comment(1)

Interesting but this is not on-topic for SO, not even CrossValidated or DataScience.SE. – Bates 29/5, 2018 at 11:32

BPE is one of the three algorithms to deal with the unknown word problem(or languages with rich morphology that require dealing with structure below the word level) in an automatic way: byte-pair encoding, unigram language modeling, WordPiece, and the BPE tokenization schema has two parts: a token learner and a token segmenter.

The token learner takes a raw training corpus (sometimes roughly pre- separated into words, for example by whitespace) and induces a vocabulary, a set of tokens. The token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary.

The algorithm has a parameter k which means k merges or k new symbols in the final vocabulary.

Let's train the BPE using this line: Pen Penapple Apple Pen adapted from PPAP, and show how the unknown/rare "words" penapplepen and applepen in the test data can be automatically reduced into known subword units.

Learning

First, after some preprocessing(case mapping, regex based pre-tokenization, and adding end-of-word symbol _) we obtain the following strings C(as our corpus) and their frequencies(frequency: string):

2: p e n _    
1: p e n a p p l e _  
1: a p p l e _

The vocabulary V is [_, p, e, n, a, l]

Now let's run the first round of the for-loop in the above pseudocode:

p, e <- most frequent pair in {(p, e): 3, (e, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1}    
pe <- p + e   
[_, p, e, n, a, l, pe] <- [_, p, e, n, a, l] + pe    
C becomes this: 
    2: pe n _    
    1: pe n a p p l e _  
    1: a p p l e _

Let's run the second merge as follows:

p, e <- most frequent pair in {(pe, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1}    
pen <- pe + n   
[_, p, e, n, a, l, pe, pen] <- [_, p, e, n, a, l, pe] + pen    
C becomes this: 
    2: pen _    
    1: pen a p p l e _  
    1: a p p l e _

And here are the next merges if we take k as N >= 9:

Merge          Current V 
(pen, _)       [_, p, e, n, a, l, pe, pen, pen_]  
(a, p)         [_, p, e, n, a, l, pe, pen, pen_, ap]
(ap, p)        [_, p, e, n, a, l, pe, pen, pen_, ap, app]
(app, l)       [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl]
(appl, e)      [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple]
(apple, _)     [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_]
(pen, apple_)  [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_, penapple_]

We see that after 9 iterations of merging, there are no adjacent pairs in C.

Parsing

The token parser just runs on the test data the merges we have learned from the training data, greedily, in the order we learned them. (Thus the frequencies in the test data don’t play a role, just the frequencies in the training data).

In this step we test the parser using this sentence: Applepen PenapplePen. As usual, we do the preprocessing we did in the training step and obtain:

a p p l e p e n _   
p e n a p p l e p e n _

and follow the merging order:

(p, e), (pe, n), (pen, _), (a, p), (ap, p), (app, l), (appl, e), (apple, _), (pen, apple_)

First, (p, e). We merge p and e in the test data and get:

a p p l e pe n _    
pe n a p p l e pe n _

Second, the (pe, n) and get:

a p p l e pen _    
pen a p p l e pen _

.....
After all the 9 turns of merging we get(if k <= 9 we just apply the first k merges in this step; if k is 2 refer to this answer):

apple pen_    
pen apple pen_

And the final tokenized test sentence is [apple, pen_, pen, apple, pen_], and the unknown(unseen in the training data) word penapplepen can also be separated.

Referneces:

Hedonics answered 2/8, 2021 at 15:18 Comment(1)

In terms of parsing, the merges learned can indeed be used to tokenize input strings, however I feel this approach is not very efficient in terms of time complexity, since the input string chars should be scanned multiple times from left to right, right? Why not use the generated vocab to tokenize input strings? – Bigley 25/11, 2023 at 20:2

Byte Pair Encoding in NLP an intermediated solution to reduce the vocabulary size when compared with word based tokens, and to cover as many frequently occurring sequence of characters in a single token without representing using long character based tokens.

Initially each character is referred by a token, and the number of merge operations is a hyper-parameter to build the BPE vocabulary.

Let us consider the sentence:

'he had a cat' and 'the cat is sitting on the mat'

We can build the vocabulary of characters as follows:

['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't']

Now we can list the pairs and its occurrences:

he: 3 (he, the*2)
ha: 1 (had)
ad: 1 (had)
ca: 2 (cat*2)
at: 3 (cat*2, mat)
th: 2
is: 1
si: 1
it: 1
ti: 1
in: 1
ng: 1
on: 1
ma: 1

Now, as the pairs 'he' and 'at' occurs most frequently in the vocabulary, it can be combined(merge operation) and added to the vocabulary.

Updated vocabulary: ['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't', 'he', 'at']

Now, if the longer parts of words that occur in the vocabulary can be referred using a single token, i.e, 'he' or 'at' is referred using single tokens instead of two character tokens.

Thus: the sentences:

'he had a cat' and 2. 'the cat is sitting on the mat'

can be tokenized as follows:

['he', 'h','a','d', 'a', 'c', 'at']
['t','he', 'c','at', 'i', 's', 's', 'i','t','t','i','n','g', 'o','n', 't', 'he', 'm', 'at']

Further, this processed can be repeated by identifying the most frequently occuring pairs:

ha: 1 (had)
ad: 1 (had)
c'at': 2 (cat*2)
th: 2
is: 1
si: 1
it: 1
ti: 1
in: 1
ng: 1
on: 1
m'at': 1

Since, the word c'at', occurs most frequently, it can be combined to form a new vocabulary as below:

['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't', 'he', 'at', 'cat']

Thus the new tokenization:

['he', 'h','a','d', 'a', 'cat']
['t','he', 'cat', 'i', 's', 's', 'i','t','t','i','n','g', 'o','n', 't', 'he', 'm', 'at']

Thus, with the higher number of merge operations, the vocabulary size increases, but the number of tokens used to represent a given text reduces.

Libove answered 17/12, 2020 at 0:3 Comment(0)

Learning

Parsing

Recommended topics

Hot tags