How to stop BERT from breaking apart specific words into word-piece
Asked Answered
A

3

6

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example:

tokenizer = BertTokenizer('bert-base-uncased-vocab.txt')
tokens = tokenizer.tokenize("metastasis")

Create tokens like this:

['meta', '##sta', '##sis']

However, I want to keep the whole words as one token, like this:

['metastasis']
Arbour answered 29/5, 2020 at 9:37 Comment(4)
Maybe ' '.join([x for x in tokens]).replace(' ##', '') will do?Molar
thanks for your answer, but I can't do this because I want these word pieces for other words (non-specific ones). for example extracting :['extract', '##ing']Arbour
You do not usually need this, subword tokenization is very useful when solving OOV words and helps decrease the vocabulary size. Why do you need to add exceptions?Molar
Please fix me if I am wrong but in my example, tokens for 'metastasis' are 'meta' and 'sta' and 'sis'. However, I want to keep 'metastasis' as one whole word because it doesn't have any relation to 'meta'.Arbour
M
8

You are free to add new tokens to the existing pretrained tokenizer, but then you need to train your model with the improved tokenizer (extra tokens).

Example:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
v = tokenizer.get_vocab()
print(len(v))
tokenizer.add_tokens(['whatever', 'underdog'])
v = tokenizer.get_vocab()
print(len(v))

If token already exists like 'whatever' it will not be added.

Output:

30522
30523
Marquesan answered 29/5, 2020 at 17:13 Comment(0)
D
1

I think if I use the solution, like

tokenizer.add_tokens(['whatever', 'underdog'])

the vocab_size is changed, this means that I can not use pretrain model from transformers? because the embedding size is not correct.

Dolomites answered 23/3, 2021 at 8:47 Comment(0)
A
0

Based on the discussion here, one way to use my own additional vocabulary dictionary which is containing the specific words is to modify the first ~1000 lines of the vocab.txt file ([unused] lines) with the specific words. For example I replaced '[unused1]' with 'metastasis' in the vocab.txt and after tokenization with the modified vocab.txt I got this output:

tokens = tokenizer.tokenize("metastasis")
Output: ['metastasis']
Arbour answered 29/5, 2020 at 11:20 Comment(1)
Your link to discussion is broken.Molar

© 2022 - 2024 — McMap. All rights reserved.