BERT training with character embeddings

Asked 31/3, 2020 at 2:30 Answered 26/10, 2020 at 14:29

Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?

Adopted answered 31/3, 2020 at 2:30 Comment(0)

That is one motivation behind the paper "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters" where BERT's wordpiece system is discarded and replaced with a CharacterCNN (just like in ELMo). This way, a word-level tokenization can be used without any OOV issues (since the model attends to each token's characters) and the model produces a single embedding for any arbitrary input token.

Performance-wise, the paper shows that CharacterBERT is generally at least as good BERT while at the same time being more robust to noisy texts.

Ivanivana answered 26/10, 2020 at 14:29 Comment(0)

It depends on what your goal is. Using standard word token would certainly work, but many words would end up out of vocabulary which would result in the model performing poorly.

Working entirely on character level might be interesting from a research perspective: seeing how to model will learn to segment the text on its own and how such a segmentation would look like compared to standard tokenization. I am not sure though if it would have benefits for practical use. Character sequences are much longer than sub-word sequences and BERT requires quadratic memory in the sequence length, it would just unnecessarily slow down both the training and inference.

Aldwin answered 31/3, 2020 at 12:58 Comment(0)

Recommended topics

Hot tags