Speed up embedding of 2M sentences with RoBERTa
Asked Answered
U

2

6

I have roughly 2 million sentences that I want to turn into vectors using Facebook AI's RoBERTa-large,fine-tuned on NLI and STSB for sentence similarity (using the awesome sentence-transformers package).

I already have a dataframe with two columns: "utterance" containing each sentence from the corpus, and "report" containing, for each sentence, the title of the document from which it is from.

From there, my code is the following:

from sentence_transformers import SentenceTransformer
from tqdm import tqdm

model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')

print("Embedding sentences")

data = pd.read_csv("data/sentences.csv")

sentences = data['utterance'].tolist()

sentence_embeddings = []

for sent in tqdm(sentences):
    embedding = model.encode([sent])
    sentence_embeddings.append(embedding[0])

data['vector'] = sentence_embeddings

Right now, tqdm estimates that the whole process will take around 160 hours on my computer, which is more than I can spare.

Is there any way I could speed this up by changing my code? Is creating a huge list in memory then appending it to the dataframe the best way to proceed here? (I suspect not).

Many thanks in advance!

Underpinnings answered 4/5, 2020 at 8:50 Comment(3)
The only reasonable speed-up will come from getting a better GPU.Bleacher
@Paul Miller Can you give the answer to your question, ie, full solution? You checked Christine_NLP as a good answer, but we need also the full solution. What is your code now, how exactly did you speed it up? I can figure it out based on Christine answer. Thanks.Chavers
@TedoVrbanec I've expanded the code below.Pedicular
P
9

I found a ridiculous speedup using this package by feeding in the utterances as a list instead of looping over the list. I assume there is some nice internal vectorisation going on.

%timeit utterances_enc = model.encode(utterances[:10])                                                                                                                                                                                                                 
3.07 s ± 53.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit utterances_enc = [model.encode(utt) for utt in utterances[:10]]
4min 1s ± 8.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

The full code would be as follows:

from sentence_transformers import SentenceTransformer
from tqdm import tqdm

model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')

print("Embedding sentences")

data = pd.read_csv("data/sentences.csv")

sentences = data['utterance'].tolist()

sentence_embeddings = model.encode(sentences)

data['vector'] = sentence_embeddings
Pedicular answered 5/5, 2020 at 11:38 Comment(3)
which package are you referring to?Feculent
@Pedicular What is the package name and can you give the full example based on question code? Thanks!Chavers
That is a very useful information! Thank you :))Mccrea
D
0

in the same model . if you want to have more efficient encode way.

you can convert sentence-transoformer model to onnxmodel of onnxruntime or planmodel of tensorrt. but the author of sentence-transformer do not provide the way to convert it.

i find a tourial show the steps of convert it . quick_sentence_transformers

Delgadillo answered 1/3, 2022 at 10:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.