Sentence embeddings from LLAMA 2 Huggingface opensource

Asked 18/8, 2023 at 1:59 Answered 7/11, 2023 at 21:19

artificial-intelligence huggingface-transformers huggingface large-language-model llama

Is there any way of getting sentence embeddings from meta-llama/Llama-2-13b-chat-hf from huggingface?

Model link: https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

I tried using transfomer.Automodel module from hugging faces to get the embeddings, but the results don't look as expected. Implementation is referred to in the below link. Reference: https://github.com/Muennighoff/sgpt#asymmetric-semantic-search-be

Beekman answered 18/8, 2023 at 1:59 Comment(3)

You can, certainly, but it's not a good choice for the job. A specialized model designed specifically for embeddings will provide much more compact (and thus, efficient-to-compare) results. – Joachima 18/8, 2023 at 2:1

What about mistral LLM? – Tropophilous 5/11, 2023 at 14:27

@Micromega: Basically the same. Retrieving sentence embeddings from LLM's without a designated training objective is an ongoing field of research. This paper proposed a prompt-based approach that works reasonably well for opt models. It might work well with mistral or not. In case you are interested, you can check my answer for further details. – Chockablock 7/11, 2023 at 21:27

Warning: You need to check if the produced sentence embeddings are meaningful, this is required because the model you are using wasn't trained to produce meaningful sentence embeddings (check this StackOverflow answer for further information).

The field of retrieving sentence embeddings from LLM's is an ongoing research topic. In the following, I will show two different approaches that could be used to retrieve sentence embeddings from Llama 2.

Weighted-Mean-Pooling

Llama is a decoder with left-to-right attention. The idea behind weighted-mean_pooling is that the tokens at the end of the sentence should contribute more than the tokens at the beginning of the sentence because their weights are contextualized with the previous tokens, while the tokens at the beginning have far less context representation.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-chat-hf"

t = AutoTokenizer.from_pretrained(model_id)
t.pad_token = t.eos_token
m = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto" )
m.eval()


texts = [
    "this is a test",
    "this is another test case with a different length",
]
t_input = t(texts, padding=True, return_tensors="pt")


with torch.no_grad():
    last_hidden_state = m(**t_input, output_hidden_states=True).hidden_states[-1]


weights_for_non_padding = t_input.attention_mask * torch.arange(start=1, end=last_hidden_state.shape[1] + 1).unsqueeze(0)

sum_embeddings = torch.sum(last_hidden_state * weights_for_non_padding.unsqueeze(-1), dim=1)
num_of_none_padding_tokens = torch.sum(weights_for_non_padding, dim=-1).unsqueeze(-1)
sentence_embeddings = sum_embeddings / num_of_none_padding_tokens

print(t_input.input_ids)
print(weights_for_non_padding)
print(num_of_none_padding_tokens)
print(sentence_embeddings.shape)

Output:

tensor([[   1,  445,  338,  263, 1243,    2,    2,    2,    2,    2],
        [   1,  445,  338, 1790, 1243, 1206,  411,  263, 1422, 3309]])
tensor([[ 1,  2,  3,  4,  5,  0,  0,  0,  0,  0],
        [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]])
tensor([[15],
        [55]])
torch.Size([2, 4096])

Prompt-based last token

Another alternative is to use a certain prompt and utilize the contextualized embedding of the last token. This approach was introduced by: Jiang et al. and showed decent results for the OPT model family without finetuning. The idea is to force the model with a certain prompt to predict exactly one word. They call it PromptEOL and used the following implementation for their experiments:

"This sentence: {text} means in one word:"

Please check their paper for further results. You can use the following code to utilize their approach with Llama:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-chat-hf"

t = AutoTokenizer.from_pretrained(model_id)
t.pad_token = t.eos_token
m = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto" )
m.eval()


texts = [
    "this is a test",
    "this is another test case with a different length",
]
prompt_template = "This sentence: {text} means in one word:"
texts = [prompt_template.format(text=x) for x in texts]

t_input = t(texts, padding=True, return_tensors="pt")

with torch.no_grad():
    last_hidden_state = m(**t_input, output_hidden_states=True, return_dict=True).hidden_states[-1]
  
idx_of_the_last_non_padding_token = t_input.attention_mask.bool().sum(1)-1
sentence_embeddings = last_hidden_state[torch.arange(last_hidden_state.shape[0]), idx_of_the_last_non_padding_token]

print(idx_of_the_last_non_padding_token)
print(sentence_embeddings.shape)

Output:

tensor([12, 17])
torch.Size([2, 4096])

Chockablock answered 7/11, 2023 at 21:19 Comment(1)

Is there any particular reason for not normalizing the embeddings? – Corkage 12/12, 2023 at 23:27

You can get sentence embedding from llama-2. Take a look at project repo: llama.cpp

You can use 'embedding.cpp' to generate sentence embedding

./embedding -m models/7B/ggml-model-q4_0.bin -p "your sentence"

https://github.com/ggerganov/llama.cpp/blob/master/examples/embedding/embedding.cpp.

As Charles Duffy also mentioned in the comment there are other specialized models designed specifically for Sentence embeddings " Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" https://www.sbert.net/.

You can see more discussion on the effectiveness of llama-based sentence embedding on this thread "Embedding doesn't seem to work?" https://github.com/ggerganov/llama.cpp/issues/899.

Cacka answered 23/8, 2023 at 18:33 Comment(0)

AnglE-LLaMA (angle-optimized text embeddings) is a good choice to generate LLaMA embedding. It has achieved state-of-the-art performance on the STS benchmark.

GitHub: https://github.com/SeanLee97/AnglE

HF: https://huggingface.co/SeanLee97/angle-llama-7b-nli-v2

Usage:

python -m pip install -U angle-emb

from angle_emb import AnglE, Prompts

# init
angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2')

# set prompt
print('All predefined prompts:', Prompts.list_prompts())
angle.set_prompt(prompt=Prompts.A)
print('prompt:', angle.prompt)

# encode text
vec = angle.encode({'text': 'hello world'}, to_numpy=True)
print(vec)
vecs = angle.encode([{'text': 'hello world1'}, {'text': 'hello world2'}], to_numpy=True)
print(vecs)

Northeasterly answered 31/10, 2023 at 0:10 Comment(5)

I am getting an assertion error "AssertionError: Torch not compiled with CUDA enabled" – Amieeamiel 4/11, 2023 at 14:57

The problems could be as follows: 1) you don't have a GPU; 2) your Pytorch version is a CPU version; 3) your Pytorch version is incompatible with your CUDA version. This github.com/pytorch/pytorch/issues/30664 might help to solve this problem. – Northeasterly 5/11, 2023 at 0:46

Is there any way of doing this on a mac? – Amieeamiel 5/11, 2023 at 14:1

Actually, it is not recommended to use CPU for LLM inference. I have no idea yet about how to use it on a mac. llama.cpp might be a solution because it also supports loading pretrained Lora weight. But I haven't tried it yet. I cannot ensure it can work. – Northeasterly 6/11, 2023 at 7:40

Or you can use the bert-based angle model, it also performs well on sentence embedding. It has relatively small-scale parameters and can work fine on the CPU. – Northeasterly 6/11, 2023 at 7:42

You can use :

from langchain.embeddings import HuggingFaceEmbeddings

model_name = "meta-llama/Llama-2-13b-chat-hf"
model_kwargs = {'use_auth_token' : token} #your token to use the models
embedding_model = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

#Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` 
#or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})
embedding_model.client.tokenizer.pad_token =  embedding_model.client.tokenizer.eos_token

embedding = embedding_model.embed_query("your text")

This is the same as :

from transformers import AutoModel, AutoTokenizer

model = "meta-llama/Llama-2-13b-chat-hf"
full_model = AutoModel.from_pretrained(model, token=token)
tokenizer = AutoTokenizer.from_pretrained(model, token=token)
seq_ids = tokenizer(text, return_tensors='pt')["input_ids"]
embedding = full_model(seq_ids)["last_hidden_state"].mean(axis=[0,1]).detach().numpy()

The embedding is the mean of the embedding tokens you get with the transformer.

The embeddings are different (and I find them better) from what you get with llama.cpp. I do know how llama.cpp calcule the embeddings.

Alfreda answered 30/10, 2023 at 14:46 Comment(0)

Weighted-Mean-Pooling

Prompt-based last token

Recommended topics

Hot tags