Text similarity using Word2Vec
Asked Answered
P

2

7

I would like to use Word2Vec to check similarity of texts.

I am currently using another logic:

from fuzzywuzzy import fuzz

def sim(name, dataset):
    matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
   return 

(name is my column).

For applying this function I do the following:

df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)

Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset?

Example of dataset:

Text
Hello, this is Peter, what would you need me to help you with today? 
I need you
Good Morning, John here, are you calling regarding your cell phone bill? 
Hi, this this is John. What can I do for you?
...

The first text and the last one are quite similar, although they have different words to express similar concept. I would like to create a new column where to put, for each row, text that are similar. I hope you can help me.

Priebe answered 22/1, 2021 at 21:0 Comment(3)
You can use a pre-trained word embedding model (word2vec, glove or fasttext) to get word embeddings. These can be added (vector additions) to represent sentences. The similarity between these vectors now can be calculated using cosine similarity. Do check my answer that elaborates on that as well as the example code. There are other ways of combining word embeddings as well. Plus you can directly use doc2vec to represent a sentence as a vector.Subsidy
Why do you want to use Word2Vec to compare sentences? Word2Vec is tailored for word embeddings, not sentence embeddings. Why not use Doc2Vec, or even better: sentence transformers?Celibacy
Hi @RJAdriaansen, I am open to other possibilities that are not word2vec. Since I am getting wrong results using fuzzy, I was thinking of Word2Vec to get better ones. My goal is to show similarity of the sentences I mentioned in my question. Thank you both for your comments and help, Akshay Sehgal and RG AdriaansenPriebe
S
21

TLDR; skip to the last section (part 4.) for code implementation

1. Fuzzy vs Word embeddings

Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.

These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar. Read more about how word2vec works internally here.

enter image description here

Let's say I have a sentence with 2 words. Word2Vec will represent each word here as a vector in some euclidean space. Summing them up, just like standard vector addition will result in another vector in the same space. This can be a good choice for representing a sentence using individual word embeddings.

NOTE: There are other methods of combining word embeddings such as a weighted sum with tf-idf weights OR just directly using sentence embeddings with an algorithm called Doc2Vec. Read more about this here.

2. Similarity between word vectors / sentence vectors

“You shall know a word by the company it keeps”

Words that occur with words (context) are usually similar in semantics/meaning. The great thing about word2vec is that words vectors for words with similar context lie closer to each other in the euclidean space. This lets you do stuff like clustering or just simple distance calculations.

enter image description here

A good way to find how similar 2 words vectors is cosine-similarity. Read more here.

3. Pre-trained word2vec models (and others)

The awesome thing about word2vec and such models is that you don't need to train them on your data for most cases. You can use pre-trained word embedding that has been trained on a ton of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.

You can check similarity between these sentence embeddings using cosine_similarity

4. Sample code implementation

I use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data

from scipy import spatial
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data

s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'

def preprocess(s):
    return [i.lower() for i in s.split()]

def get_vector(s):
    return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)


print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071
Subsidy answered 9/2, 2021 at 21:30 Comment(12)
Thank you so much for your answer, Akshay Sehgal. I have not considered glove, but I think it could work as well. I am having some difficulties in implementing the glove model in my function. My goal would be to show better results than those ones I got using fuzzy logicPriebe
You can replace glove by a word2vec model as well. Check the models here github.com/RaRe-Technologies/gensim-dataSubsidy
@Math, let me know what difficulties you are facingSubsidy
Thanks Askhay Sehgal. My difficulties are in replacing word2vec here: dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5) in order to compare each text with another . I think I cannot 'just replace' the fuzz.ratio with word2vec, so I am trying to understand and apply the information that you provide with that example, to my code (unfortunately still with no success)Priebe
You are correct when you say you cant use word2vec directly instead of fuzz.ratio. These are 2 different methods and provide completely different methods of measuring similarity/distance between 2 sentences.Subsidy
The first one is pure edit distance how many edits do i need to make to change this word to that word. The second is more to do with a context/semantic based representation of words/sentences, and thus requires cosine similarity.Subsidy
Let me know if you need any guidance. We can chat on Stack overflow.Subsidy
Context / semantic similarty is a different ball game and therefore will have to tackled separatelySubsidy
I can recommend a few reading materials for you reference. But I do hope that the code example i share above is relevant to you. (check the s0, s1, s2, s3 that I compare)Subsidy
You can always comeup with a lambda function to do this. Lets chat so that I can get a better understanding of what you are trying to achieve.Subsidy
Thanks for your help Akshay. So my problem is that I am trying to implement the word2vec within that function in order to apply a comparison among all the sentences, but it is not so straightforward. Probably word2vec was not a great idea in my case, as igrinis said ini his/her answer. However, either with Word2Vec or with Universal Sentence Encoder, I have not been able to include this part in my codePriebe
Just to add, Word2vec and Glove are a sufficiently good way to create similarities between sentences as well. Its basic linear algebra. Add multiple vectors together to create a final vector in the same space. Yes, it does have some issues but for your case, it should work fine.Subsidy
S
1

If you want to compare sentences you should not use Word2Vec or GloVe embeddings. They translate every word in a sentence into a vector. It is quite cumbersome to get how similar those sentences out of the two sets of such vectors. You should use something that is tailored to convert whole sentence into a single vector. Then you just need to compare how similar two vector are. Universal Sentence Encoder is one of the best encoders considering computational cost and accuracy trade off (the DAN variant). See example of usage in this post. I believe it describes a use case which is quite close to yours.

Succuss answered 11/2, 2021 at 17:53 Comment(4)
Hi, thanks for the details, however, a small note. Word2vec and Glove are a sufficiently good way to create similarities between sentences as well. It's basic linear algebra and it lets you create a semantic representation of a sentence based on the words, in the same vector space. And you can find a billion papers and guides where this is used as a viable approach. Is it the best approach? No, but with limited data, for comparative analysis against approaches like fuzzy matching, Jaccard scores, and other keyword/alphabet-based similarity its a viable approach.Subsidy
As you said it is definitely not the best way for getting semantic similarity. Using USE, Doc2Vec or SentenceBERT is a better choice (accuracy wise). Considering the availability of the pretrained models, the only question left is compute power required.Succuss
Correct, but OP doesnt have training data to build their own doc2vec. they can however leverage pretrained word2vec to compare semantic similarities out of the box.Subsidy
The model of Universal Sentence Encoder is ready to be used immediately and does not require training by user. All one need to do is to call embed() and compare the similarity between two vectors (one for each sentence).Succuss

© 2022 - 2024 — McMap. All rights reserved.