How to untokenize BERT tokens?

Asked 16/2, 2021 at 22:14 Answered 20/2, 2021 at 13:19

python tokenize bert-language-model huggingface-transformers huggingface-tokenizers

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.

from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"

tokens = tz.tokenize(sentence)
print(tokens)

>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']

What I want is to get the text corresponding to 4 tokens to the left and to the right of the token Madrid. So i want the tokens: ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] and then transform them into the original text. In this case it would be 'Natural Science Museum of Madrid shows the REC'.

Is there a way to do this?

Clermontferrand answered 16/2, 2021 at 22:14 Comment(0)

In addition to the information provided by Jindrich about the information loss, I want to add that huggingface provides a build-in method to convert tokens to a string (the lost information remains lost!). The method is called convert_tokens_to_string:

tz.convert_tokens_to_string(tokens[1:10])

Output:

'Natural Science Museum of Madrid shows the REC'

Dram answered 20/2, 2021 at 13:19 Comment(2)

But can I be sure that when I tokenize the untokenized sentence (i.e., 'Natural Science Museum of Madrid shows the REC'), the resulting tokens are the same as the original ones? – Clermontferrand 21/2, 2021 at 23:12

What do you mean with original ones? You can not be sure that u can reconstruct the string as Jindrich has explained in his answer. Another example are unknown tokens [UNK]. For example, the following leads to an [UNK]: t.tokenize("The Natural Science Museum of Madrid 🐳 ") which means you lose information during tokenization. @Clermontferrand – Dram 22/2, 2021 at 1:32

BERT uses word-piece tokenization that is unfortunately not loss-less, i.e., you are never guaranteed to get the same sentence after detokenization. This is a big difference from RoBERTa that uses SentencePiece that is fully revertable.

You can get the so-called pre-tokenized text where merging tokens starting with ##.

pretok_sent = ""
for tok in tokens:
     if tok.startswith("##"):
         pretok_sent += tok[2:]
     else:
         pretok_sent += " " + tok
pretok_sent = pretok_sent[1:]

This snippet reconstructs the sentence in your example, but note that if the sentence would contain punctuation, the punctuation will remain separated from the other tokens, which is the pre-tokenization. The sentence can look like this:

'This is a sentence ( with brackets ) .'

Going from the pre-tokenized to a standard sentence is the lossy step (you can never know if and how many extra spaces were in the original sentence). You can get a standard sentence by applying detokenization rules, such as in sacremoses.

import sacremoses
detok = sacremoses.MosesDetokenizer('en')
detok(sent.split(" "))

This results in:

'This is a sentence (with brackets).'

Brina answered 17/2, 2021 at 9:4 Comment(1)

This isn't exactly true because if you use a Pipeline it returns the original start and end index from the string with the prediction. This means you can recreate the text but you could not disambiguate whitespace character type – Fluctuate 10/6, 2022 at 11:52

Recommended topics

Hot tags