How does masked_lm_labels argument work in BertForMaskedLM?
Asked Answered
E

1

1
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)

loss, prediction_scores = outputs[:2] 

This code is from huggingface transformers page. https://huggingface.co/transformers/model_doc/bert.html#bertformaskedlm

I cannot understand the masked_lm_labels=input_ids argument in model. How does it work? Does it means that it will automatically mask some of the text when input_ids is passed?

Ere answered 28/4, 2020 at 0:34 Comment(0)
C
0

The first argument is the masked input, the masked_lm_labels argument is the desired output.

The input_ids should be masked. In general, it is up to you how you do the masking. In the original BERT, they choose 15% tokens and the following with them, either

  • Use [MASK] tokens; or
  • Use a random token; or
  • Keep the original token unchanged.

This modifies the input, so you need to tell your model what original non-masked input, which is the masked_lm_labels argument. Note also, that you do not want to compute the loss only for the tokens that were actually chosen for masking. The rest of the tokens should be replaced with an index -100.

For more details, see the documentation.

Cherice answered 28/4, 2020 at 7:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.