Cosine Similarity between columns of two dataframes of differing lengths?
Asked Answered
W

1

5

I have text column in df1 and text column in df2. The length of df2 will be different to that of length of df1. I want to calculare cosine similarity for every entry in df1[text] against every entry in df2[text] and give a score for every match.

sample input

df1                           
mahesh                 
suresh


df2                                                                                  
surendra    
mahesh    
shrivatsa    
suresh    
maheshwari

sample output

mahesh    surendra       30
mahesh    mahesh         100
mahesh    shrivatsa      20
mahesh    suresh         60
mahesh    maheshwari     80
suresh    surendra       70
suresh    mahesh         60
suresh    shrivatsa      40
suresh    suresh         100
suresh    maheshwari     30

i was facing issues( getting key errors) when I was trying to match these two columns for similarity using tf-idf approach as these columns were of different lengths . is there any other way through we can solve this problem... Any help would be greatly appreicated. I have searched a lot and found that in almost all cases people were comparing the first document against rest of documents in the same corpus. here it is like comparing every document of corpus 1 with every document on corpus2 .

Wateriness answered 31/12, 2019 at 10:15 Comment(1)
How are you calculating cosine similarity for strings?Campstool
A
9

There are many different string distance measures. I can't be sure how to use cosine similarity for this case, though I suggest looking into a strsim library.

I'll give you an example of how I would approach the issue using Jaro-Winkler metric which is best suited for short strings.

Also, I'm including my attempt to use cosine similarity given the example from the documentation of said library.

It could be completely wrong but should give you a general idea of how to make dataframe from the cartesian product of two columns of different lengths, as well as how to apply strsim's algorithms to the data stored in pd.DataFrame


Data preparation:

import pandas as pd

from similarity.jarowinkler import JaroWinkler
from similarity.cosine import Cosine


df1 = pd.DataFrame({
    "name": ["mahesh", "suresh"]
})

df2 = pd.DataFrame({
    "name": ["mahesh", "surendra", "shrivatsa", "suresh", "maheshwari"]
})

df = pd.MultiIndex.from_product(
    [df1["name"], df2["name"]], names=["col1", "col2"]
).to_frame(index=False)

returns:

     col1        col2
0  mahesh      mahesh
1  mahesh    surendra
2  mahesh   shrivatsa
3  mahesh      suresh
4  mahesh  maheshwari
5  suresh      mahesh
6  suresh    surendra
7  suresh   shrivatsa
8  suresh      suresh
9  suresh  maheshwari

Jaro-Winkler:

jarowinkler = JaroWinkler()
df["jarowinkler_sim"] = [jarowinkler.similarity(i,j) for i,j in zip(df["col1"],df["col2"])]

returns:

    col1    col2        jarowinkler_sim
0   mahesh  mahesh      1.0
1   mahesh  surendra    0.4305555555555555
2   mahesh  shrivatsa   0.5185185185185185
3   mahesh  suresh      0.6666666666666666
4   mahesh  maheshwari  0.9466666666666667
5   suresh  mahesh      0.6666666666666666
6   suresh  surendra    0.8333333333333334
7   suresh  shrivatsa   0.611111111111111
8   suresh  suresh      1.0
9   suresh  maheshwari  0.48888888888888893


Cosine similarity:

cosine = Cosine(2)
df["p0"] = df["col1"].apply(lambda s: cosine.get_profile(s)) 
df["p1"] = df["col2"].apply(lambda s: cosine.get_profile(s)) 
df["cosine_sim"] = [cosine.similarity_profiles(p0,p1) for p0,p1 in zip(df["p0"],df["p1"])]

df.drop(["p0", "p1"], axis=1)

returns:

    col1    col2        cosine_sim
0   mahesh  mahesh      0.9999999999999998
1   mahesh  surendra    0.0
2   mahesh  shrivatsa   0.15811388300841897
3   mahesh  suresh      0.3999999999999999
4   mahesh  maheshwari  0.7453559924999299
5   suresh  mahesh      0.3999999999999999
6   suresh  surendra    0.5070925528371099
7   suresh  shrivatsa   0.15811388300841897
8   suresh  suresh      0.9999999999999998
9   suresh  maheshwari  0.29814239699997197

Attire answered 31/12, 2019 at 11:14 Comment(2)
Thanks Political Scientist , When I tried df = pd.MultiIndex.from_product( [df1["name"], df2["name"]], names=["col1", "col2"] ).to_frame(index=False) with huge data ( both columns in 50K + it results in memory error) is there any optimzed way to do the same for huge amt of dataWateriness
@pythonlearner is there any duplicates? You could try filter them out. If it doesn’t help you might as well try iterations.product function to avoid using pandas altogetherAttire

© 2022 - 2024 — McMap. All rights reserved.