Universal sentence encoder for big document similarity
Asked Answered
S

1

6

I need to create a 'search engine' experience : from a short query (few words), I need to find the relevant documents in a corpus of thousands documents.

After analyzing few approaches, I got very good results with the Universal Sentence Encoder from Google. The problem is that my documents can be very long. For these very long texts it looks like the performance are decreasing so my idea was to cut the text in sentences/paragraph.

So I ended up with getting a list of vectors for each document (representing each part of the document).

My question is : is there a state-of-the-art algorithm/methodology to compute a scoring from a list of vector ? I don't really want to merge them into one as it would create the same effect than before (the relevant part would be diluted in the document). Any scoring algorithms to sum up the multiple cosine similarities between the query and the different parts of the text ?

important information : I can have short and long text. So I can have 1 up to 10 vectors for a document.

Susanasusanetta answered 23/12, 2019 at 17:6 Comment(2)
Why not just test which works better: max or average?Kinin
Did you figure out the answer? I am doing similar stuff and have large docs with 22-45 paras with around 10K docs in collection. At present, I am unable to figure out whether I should go with USE or notRima
W
1

One way of doing this is to embed all sentences of all documents (typically storing them in an index such as FAISS or elastic). Store the document identifier of each sentence. In Elastic this can be metadata but in FAISS this needs to be held in an external mapping. Then:

  1. embed query
  2. Calculate cosine similarity between query and all sentence embeddings
  3. For top-k results, group by document identifier and take the sum (this step is optional depending on whether youre looking for the most similar document or the most similar sentence, here I suppose that you are looking for the most similar document, thereby boosting documents with a higher similarity).

Then you should have an ordered list of relevant document identifiers.

Wildman answered 11/10, 2021 at 8:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.