How can I get RoBERTa word embeddings?

Asked 24/3, 2020 at 3:33 Answered 7/11, 2023 at 22:51

Given a sentence of the type 'Roberta is a heavily optimized version of BERT.', I need to get the embeddings for each of the words in this sentence with RoBERTa. I have tried to look at the sample codes online, failing to find a definite answer.

My take is the following:

tokens = roberta.encode(headline)
all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
embedding = all_layers[0]
n = embedding.size()[1] - 1
embedding = embedding[:,1:n,:]

where embedding[:,1:n,:] is used to extract only the embeddings for the words in the sentence, without the start and end tokens.

Is it correct?

Acea answered 24/3, 2020 at 3:33 Comment(2)

I'm assuming you're using huggingface's library for this? If so, please update your tags accordingly (bert is unused, but you can use huggingface-transformers instead). Since there are several implementations, it is otherwise hard to termine the right implementation and give a correct answer. – Thurifer 26/3, 2020 at 9:0

I don't think it's correct. My understanding is embedding = all_layers[-1][-1] – Apostate 9/8, 2020 at 7:38

TOKENIZER_PATH = "../input/roberta-transformers-pytorch/roberta-base"
ROBERTA_PATH = "../input/roberta-transformers-pytorch/roberta-base"

text= "How are you? I am good."
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

##how the words are broken into tokens
print(tokenizer.tokenize(text))

##the format of a encoding
print(tokenizer.batch_encode_plus([text]))

##op wants the input id
print(tokenizer.batch_encode_plus([text])['input_ids'])

##op wants the input id without first and last token
print(tokenizer.batch_encode_plus([text])['input_ids'][0][1:-1])

Output:

['How', 'Ġare', 'Ġyou', '?', 'ĠI', 'Ġam', 'Ġgood', '.']

{'input_ids': [[0, 6179, 32, 47, 116, 38, 524, 205, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

[[0, 6179, 32,47, 116, 38, 524, 205, 4, 2]]

[6179, 32, 47, 116, 38, 524, 205, 4]

And dont worry about the "Ġ" character. It just indicates that there is a space in front of the word.

Hindrance answered 26/7, 2021 at 12:0 Comment(0)

To get word embeddings from RoBERTa you can average embeddings of the subwords (as per the tokenizer) that make up the word of interest. There are other approaches as well.

Remember that RoBERTa (pre-trained for MLM) produces context-sensitive embeddings, and using RoBERTa without fine-tuning it to produce embeddings from individual words (w/o any context) might perform poorly for a downstream task.

Something like this should work:

def get_hidden_states(encoded, token_ids_word, model, layers):
    # inference
    with torch.no_grad():
        output = model(**encoded)
    # get all hidden states
    states = output.hidden_states
    # stack and sum layers
    output = torch.stack([states[i] for i in layers]).sum(0).squeeze()
    # subset for tokens that make up the word
    word_tokens_output = output[token_ids_word]
    return word_tokens_output.mean(dim=0)
 
def get_word_vector(sent, idx, tokenizer, model, layers, device):
    # tokenize the input sentence and sending to device
    encoded = tokenizer.encode_plus(sent, add_special_tokens=True, return_tensors="pt").to(device)
    # get all token idxs that make up the word 
    token_ids_word = np.where(np.array(encoded.word_ids()) == idx)
    # get all hidden states
    return get_hidden_states(encoded, token_ids_word, model, layers)

def get_embedding(model, tokenizer, sent, word, device, layers=None):
    # using last four layers by default
    layers = [-4, -3, -2, -1] if layers is None else layers
    # get idx for the word
    idx = sent.split(" ").index(word)
    # get word embedding
    word_embedding = get_word_vector(sent, idx, tokenizer, model, layers, device)
    return word_embedding

Achlamydeous answered 7/11, 2023 at 22:51 Comment(0)

Recommended topics

Hot tags