I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example:
tokenizer = BertTokenizer('bert-base-uncased-vocab.txt')
tokens = tokenizer.tokenize("metastasis")
Create tokens like this:
['meta', '##sta', '##sis']
However, I want to keep the whole words as one token, like this:
['metastasis']
' '.join([x for x in tokens]).replace(' ##', '')
will do? – Molar