I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.
If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.
import tensorflow as tf
import tensorflow_datasets
from transformers import *
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my
brother thinks they are more cute"
tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()