Updating a BERT model through Huggingface transformers

import tensorflow as tf import tensorflow_datasets from transformers import * model = BertModel.from_pretrained('bert-base-uncased') tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') SPECIAL_TOKEN_1="dogs are very cute" SPECIAL_TOKEN_2="dogs are cute but i like cats better and my brother thinks they are more cute" tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2]) model.resize_token_embeddings(len(tokenizer)) #Train our model model.train() model.eval()

BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).

If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:

from transformers import BertTokenizer, BertForMaskedLM 
import torch   

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True) 

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") 
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

outputs = model(**inputs, labels=labels) 
loss = outputs.loss 
logits = outputs.logits

You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.

You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

Recommended topics

Hot tags