How to untokenize BERT tokens?
Asked Answered
C

2

6

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.

from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"

tokens = tz.tokenize(sentence)
print(tokens)

>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']

What I want is to get the text corresponding to 4 tokens to the left and to the right of the token Madrid. So i want the tokens: ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] and then transform them into the original text. In this case it would be 'Natural Science Museum of Madrid shows the REC'.

Is there a way to do this?

Clermontferrand answered 16/2, 2021 at 22:14 Comment(0)
D
12

In addition to the information provided by Jindrich about the information loss, I want to add that huggingface provides a build-in method to convert tokens to a string (the lost information remains lost!). The method is called convert_tokens_to_string:

tz.convert_tokens_to_string(tokens[1:10])

Output:

'Natural Science Museum of Madrid shows the REC'
Dram answered 20/2, 2021 at 13:19 Comment(2)
But can I be sure that when I tokenize the untokenized sentence (i.e., 'Natural Science Museum of Madrid shows the REC'), the resulting tokens are the same as the original ones?Clermontferrand
What do you mean with original ones? You can not be sure that u can reconstruct the string as Jindrich has explained in his answer. Another example are unknown tokens [UNK]. For example, the following leads to an [UNK]: t.tokenize("The Natural Science Museum of Madrid 🐳 ") which means you lose information during tokenization. @ClermontferrandDram
B
5

BERT uses word-piece tokenization that is unfortunately not loss-less, i.e., you are never guaranteed to get the same sentence after detokenization. This is a big difference from RoBERTa that uses SentencePiece that is fully revertable.

You can get the so-called pre-tokenized text where merging tokens starting with ##.

pretok_sent = ""
for tok in tokens:
     if tok.startswith("##"):
         pretok_sent += tok[2:]
     else:
         pretok_sent += " " + tok
pretok_sent = pretok_sent[1:]

This snippet reconstructs the sentence in your example, but note that if the sentence would contain punctuation, the punctuation will remain separated from the other tokens, which is the pre-tokenization. The sentence can look like this:

'This is a sentence ( with brackets ) .'

Going from the pre-tokenized to a standard sentence is the lossy step (you can never know if and how many extra spaces were in the original sentence). You can get a standard sentence by applying detokenization rules, such as in sacremoses.

import sacremoses
detok = sacremoses.MosesDetokenizer('en')
detok(sent.split(" "))

This results in:

'This is a sentence (with brackets).'
Brina answered 17/2, 2021 at 9:4 Comment(1)
This isn't exactly true because if you use a Pipeline it returns the original start and end index from the string with the prediction. This means you can recreate the text but you could not disambiguate whitespace character typeFluctuate

© 2022 - 2024 — McMap. All rights reserved.