Is it possible to use Google BERT to calculate similarity between two textual documents?

Asked 11/9, 2019 at 5:3 Answered 24/1 at 16:11

python text scikit-learn nlp word-embedding

Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:

https://github.com/AndriyMulyar/semantic-text-similarity

https://github.com/beekbin/bert-cosine-sim

Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?

Somerset answered 11/9, 2019 at 5:3 Comment(1)

yes, use sentence-bert. average document's sentence embeddings. cosine similarity between documents. – Jerryjerrybuild 12/6, 2020 at 21:25

BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.

BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.

This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.

BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at

See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.

Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:

Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE

Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!

Galvanize answered 4/2, 2020 at 21:57 Comment(2)

One training objective of BERT is indeed to predict the next sentence in a document. Is there any evaluation of your proposed method of summarization -> sentence similarity? – Otalgia 23/3, 2020 at 8:47

That's an excellent answer. I have a comment though: I am wondering whether instead of doing the STS-B task which would require to re-run that little monster on every summarized text, would also some off-the-self clustering technique on euclidean distances the final embedding outputs of BERT also work? I was trying that recently and I think I got some pretty decent results. – Jaal 6/4, 2020 at 10:43

BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.

It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.

There are some document-level embeddings. For instance doc2vec is a commonly used option.

As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.

Colincolinson answered 11/9, 2019 at 12:30 Comment(0)

To add to @jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.

Pantalets answered 5/12, 2020 at 16:35 Comment(0)

Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.

Galvanize answered 23/3, 2020 at 9:3 Comment(0)

Use a sentence transformer for this: https://www.sbert.net/

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

You can also store embeddings in a vector database like Milvus and then run queries on the document embeddings.

Forewent answered 24/1 at 16:11 Comment(0)

Recommended topics

Hot tags