Spacy, Strange similarity between two sentences

C

5

25

I have downloaded en_core_web_lg model and trying to find similarity between two sentences:

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

Which returns very strange value:

0.9066019751888448

These two sentences should not be 90% similar they have very different meanings.

Why this is happening? Do I need to add some kind of additional vocabulary in order to make similarity result more reasonable?

Creator answered 31/8, 2018 at 10:55 Comment(0)

S

17

The Spacy documentation for vector similarity explains the basic idea of it:
Each word has a vector representation, learned by contextual embeddings (Word2Vec), which are trained on the corpora, as explained in the documentation.

Now, the word embedding of a full sentence is simply the average over all different words. If you now have a lot of words that semantically lie in the same region (as for example filler words like "he", "was", "this", ...), and the additional vocabulary "cancels out", then you might end up with a similarity as seen in your case.

The question is rightfully what you can do about it: From my perspective, you could come up with a more complex similarity measure. As the search_doc and main_doc have additional information, like the original sentence, you could modify the vectors by a length difference penalty, or alternatively try to compare shorter pieces of the sentence, and compute pairwise similarities (then again, the question would be which parts to compare).

For now, there is no clean way to simply resolve this issue, sadly.

Smoking answered 3/9, 2018 at 5:40 Comment(1)

The clean way is either to have more meaningful vector representations or judge similarity by only meaningful words (see the answer below). – Gine 30/6, 2020 at 19:0

B

37

Spacy constructs sentence embedding by averaging the word embeddings. Since, in an ordinary sentence, there are a lot of meaningless words (called stop words), you get poor results. You can remove them like this:

search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))

print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))

or only keep nouns, since they have the most information:

doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']]))

Batholomew answered 6/1, 2019 at 21:1 Comment(1)

From reading this and others, it clarifies the misconception I had that stop words were removed in document similarity. This particular answer is great because it focuses on actual content, while cutting down on noise words and making the similarity calculation faster. – Hedges 5/5, 2020 at 14:45

S

17

The Spacy documentation for vector similarity explains the basic idea of it:
Each word has a vector representation, learned by contextual embeddings (Word2Vec), which are trained on the corpora, as explained in the documentation.

Now, the word embedding of a full sentence is simply the average over all different words. If you now have a lot of words that semantically lie in the same region (as for example filler words like "he", "was", "this", ...), and the additional vocabulary "cancels out", then you might end up with a similarity as seen in your case.

The question is rightfully what you can do about it: From my perspective, you could come up with a more complex similarity measure. As the search_doc and main_doc have additional information, like the original sentence, you could modify the vectors by a length difference penalty, or alternatively try to compare shorter pieces of the sentence, and compute pairwise similarities (then again, the question would be which parts to compare).

For now, there is no clean way to simply resolve this issue, sadly.

Smoking answered 3/9, 2018 at 5:40 Comment(1)

The clean way is either to have more meaningful vector representations or judge similarity by only meaningful words (see the answer below). – Gine 30/6, 2020 at 19:0

A

16

As noted by others, you may want to use Universal Sentence Encoder or Infersent.

For Universal Sentence Encoder, you can install pre-built SpaCy models that manage the wrapping of TFHub, so that you just need to install the package with pip so that the vectors and similarity will work as expected.

You can follow the instruction of this repository (I am the author) https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub

Install the model: pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.3/en_use_md-0.4.3.tar.gz#en_use_md-0.4.3
Load and use the model

import spacy
# this loads the wrapper
nlp = spacy.load('en_use_md')

# your sentences
search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))
# this will print 0.310783598221594

Assyria answered 22/4, 2020 at 13:33 Comment(6)

Please disclose that you are the author of the package mentioned (although it's quite obvious) – Cradling 22/4, 2020 at 16:15

Thanks @Cradling I added the mention – Assyria 22/4, 2020 at 16:48

when using a model like this should I still remove stop words or do they get used as part of the necessary context? – Bayne 10/7, 2020 at 14:27

@Bayne you don't need to remove stopwords or lemmatize. The Universal Sentence Encoder can process directly your unprocessed text – Assyria 14/7, 2020 at 14:18

Hi @MartinoMensio, Thanks for the response. How can I load 'en_use_md' from disk? My server is not connected to the Internet. Is there any workaround? 1. Which file to download, and then how to spacy.load... ? Thank you. – Jibber 11/11, 2021 at 13:24

a late answer to @Droid-Bird: spacy.io/usage/saving-loading basically spacy package install is just downloading data binary to the disk in the site-packages dir. you can download manually the content of this dir: github.com/MartinoMensio/spacy-universal-sentence-encoder/tree/…, copy to the server and use nlp.from_disk("/path") – Guyot 14/11, 2022 at 21:21

C

7

Now Universal Sentence Encoder available on spaCy official site: https://spacy.io/universe/project/spacy-universal-sentence-encoder

1. Installation:

pip install spacy-universal-sentence-encoder

2. Code example:

import spacy_universal_sentence_encoder
# load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')
# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))

Corduroy answered 27/11, 2022 at 19:29 Comment(1)

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review – Proofread 30/11, 2022 at 21:59

E

5

As pointed out by @dennlinger, Spacy's sentence embeddings are just the average of all word vector embeddings taken individually. So if you have a sentence with negating words like "good" and "bad" their vectors might cancel each other out resulting in not so good contextual embeddings. If your usecase is specific to get sentence embeddings then you should try out below SOTA approaches.

Google's Universal Sentence Encoder: https://tfhub.dev/google/universal-sentence-encoder/2
Facebook's Infersent Encoder: https://github.com/facebookresearch/InferSent

I have tried both these embeddings and gives you good results to start with most of the times and use word embeddings as a base for building sentence embeddings.

Cheers!

Emmanuelemmeline answered 14/9, 2019 at 6:4 Comment(0)

Recommended topics

Hot tags