how to compare two string variables in pandas?
Asked Answered
A

5

15

I have two string columns in my Pandas dataset

name1     name2
John Doe  John Doe
AleX T    Franz K

and I need to check whether name1 equals name2. The naive way I use now is using a simple mask

mask=df.name1==df.name2

But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.

For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.

One idea would be to look whether name1 is contained in name2. But it seems I cannot use str.contains with another variable as argument. Any other ideas?

Many thanks!

EDIT: using isin gives non-sensical results. Example

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})

test
Out[6]: 
           A             B
0   john doe      john doe
1   john doe  eddie murphy
2       John        batman

test['A'].isin(test['B'])
Out[7]: 
0    False
1     True
2    False
Name: A, dtype: bool
Anaclitic answered 11/3, 2016 at 12:57 Comment(6)
Hi @edchum this is not a duplicate of the question you mention.Accumbent
So are you after something like df['name1'].str.lower() == df['name2'].str.lower()?Pauwles
not really, because I believe perfect equality generate too many false negatives in the data. checking whether name1 is contained somewhere in name2 seems more appropriateAccumbent
I think in this case you should strip the spaces and lower the case, see my answerPauwles
If you proposed using something like in then you'd have to do this using apply like df.apply(lambda x: x['A'].strip().lower() in x['B'].strip().lower(), axis=1)Pauwles
.isin() does not give nonsensical results, it does exactly what the doc says; if you do .isin(test['B']), you're comparing each LHS to the entire column test['B'] (the entire series) not just an individual row-by-row comparison (for which you'd have to do an .apply(..., axis=1)) with a lambda fn. (Although, that's a good pandas docbug catch.)Puebla
D
8

I think you can use str.lower and str.replace with arbitrary whitespace s/+:

test = pd.DataFrame({'A': ["john  doe", " john doe", 'John'], 
                     'B': [' john doe', 'eddie murphy', 'batman']})

print test['A'].str.lower().str.replace('s/+',"") == 
      test['B'].str.strip().str.replace('s/+',"")


0     True
1    False
2    False
dtype: bool
Dendrite answered 11/3, 2016 at 13:2 Comment(3)
I dont understand why this gives me non-sensical results. please see my updated questionAccumbent
do you have any ideas whats going on with isin here?Accumbent
Sorry, I was offline.Dendrite
T
5

You can use difflib to compute distance

import difflib as dfl
dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio()

edit : integration with Pandas :

import pandas as pd
import difflib as dfl
df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1)
Toxicity answered 11/3, 2016 at 13:24 Comment(1)
does that integrate with pandas dataframes?Accumbent
P
4

strip the spaces and lower the case:

In [414]:
test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower()

Out[414]:
0     True
1    False
2    False
dtype: bool
Pauwles answered 11/3, 2016 at 13:37 Comment(0)
G
1

What you want is a string distance based on editing effort distance(s1, s2), which is what we call edit distance of strings. Once you define that function in your namespace you can do:

df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2']))
filtered = df[df['distance_s'] < eps] # you define eps

From a Google search, the following came up:

https://pypi.python.org/pypi/editdistance

It is a dynamic programming problem, so you can challenge yourself by writing your own too. It may not be as efficient though.

Gaussmeter answered 11/3, 2016 at 13:4 Comment(0)
A
0

Having stumbled upon a similar problem, I discovered that difflib.Sequencematcher ratio is a good measure of similarity between two texts. However, my goal was to carefully inspect the difference between two texts, and there were no convenient solution for it. So I've created a library. Here's how you could use it:

from IPython import display
import pandas as pd
from pandas_text_comparer import TextComparer


# A toy dataset. Replace with your data
df = pd.read_csv("https://github.com/n-splv/pandas-text-comparer/raw/main/data/demo/review-responses.csv.gz")

comparer = TextComparer(df, column_a="llm_response", column_b="human_response")
comparer.run()

html = comparer.get_html()
display.HTML(html)

comparison result

There's more to it, e.g. you can filter the rows, sort them by ratio, add other columns to the view etc. Feel free to read more in the repo. Hope it helps someone!

Andromada answered 22/4 at 15:43 Comment(2)
Please elaborate how this would specifically answer OP's question. Similarity screening is not the request. They want exact match more or less a number of minor -but admittedly ill-defined- differences such as trailing characters. Have you tried with their provided input instead of that repo's own demo dataset?Corporeity
Hey! With the help of this library, OP will be able to quickly spot the ways his name1 and name2 may differ and based on that write a preprocessing function, which should then be applied to both texts.Andromada

© 2022 - 2024 — McMap. All rights reserved.