how to compare two string variables in pandas?

Asked 11/3, 2016 at 12:57 Answered 22/4 at 15:43

I have two string columns in my Pandas dataset

name1     name2
John Doe  John Doe
AleX T    Franz K

and I need to check whether name1 equals name2. The naive way I use now is using a simple mask

mask=df.name1==df.name2

But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.

For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.

One idea would be to look whether name1 is contained in name2. But it seems I cannot use str.contains with another variable as argument. Any other ideas?

Many thanks!

EDIT: using isin gives non-sensical results. Example

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})

test
Out[6]: 
           A             B
0   john doe      john doe
1   john doe  eddie murphy
2       John        batman

test['A'].isin(test['B'])
Out[7]: 
0    False
1     True
2    False
Name: A, dtype: bool

Anaclitic answered 11/3, 2016 at 12:57 Comment(6)

Hi @edchum this is not a duplicate of the question you mention. – Accumbent 11/3, 2016 at 13:34

So are you after something like df['name1'].str.lower() == df['name2'].str.lower()? – Pauwles 11/3, 2016 at 13:35

not really, because I believe perfect equality generate too many false negatives in the data. checking whether name1 is contained somewhere in name2 seems more appropriate – Accumbent 11/3, 2016 at 13:38

I think in this case you should strip the spaces and lower the case, see my answer – Pauwles 11/3, 2016 at 13:38

If you proposed using something like in then you'd have to do this using apply like df.apply(lambda x: x['A'].strip().lower() in x['B'].strip().lower(), axis=1) – Pauwles 11/3, 2016 at 13:40

.isin() does not give nonsensical results, it does exactly what the doc says; if you do .isin(test['B']), you're comparing each LHS to the entire column test['B'] (the entire series) not just an individual row-by-row comparison (for which you'd have to do an .apply(..., axis=1)) with a lambda fn. (Although, that's a good pandas docbug catch.) – Puebla 11/2, 2021 at 7:29

I think you can use str.lower and str.replace with arbitrary whitespace s/+:

test = pd.DataFrame({'A': ["john  doe", " john doe", 'John'], 
                     'B': [' john doe', 'eddie murphy', 'batman']})

print test['A'].str.lower().str.replace('s/+',"") == 
      test['B'].str.strip().str.replace('s/+',"")


0     True
1    False
2    False
dtype: bool

Dendrite answered 11/3, 2016 at 13:2 Comment(3)

I dont understand why this gives me non-sensical results. please see my updated question – Accumbent 11/3, 2016 at 13:31

do you have any ideas whats going on with isin here? – Accumbent 11/3, 2016 at 13:39

Sorry, I was offline. – Dendrite 11/3, 2016 at 13:41

You can use difflib to compute distance

import difflib as dfl
dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio()

edit : integration with Pandas :

import pandas as pd
import difflib as dfl
df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1)

Toxicity answered 11/3, 2016 at 13:24 Comment(1)

does that integrate with pandas dataframes? – Accumbent 11/3, 2016 at 13:29

strip the spaces and lower the case:

In [414]:
test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower()

Out[414]:
0     True
1    False
2    False
dtype: bool

Pauwles answered 11/3, 2016 at 13:37 Comment(0)

What you want is a string distance based on editing effort distance(s1, s2), which is what we call edit distance of strings. Once you define that function in your namespace you can do:

df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2']))
filtered = df[df['distance_s'] < eps] # you define eps

From a Google search, the following came up:

https://pypi.python.org/pypi/editdistance

It is a dynamic programming problem, so you can challenge yourself by writing your own too. It may not be as efficient though.

Gaussmeter answered 11/3, 2016 at 13:4 Comment(0)

Having stumbled upon a similar problem, I discovered that difflib.Sequencematcher ratio is a good measure of similarity between two texts. However, my goal was to carefully inspect the difference between two texts, and there were no convenient solution for it. So I've created a library. Here's how you could use it:

from IPython import display
import pandas as pd
from pandas_text_comparer import TextComparer


# A toy dataset. Replace with your data
df = pd.read_csv("https://github.com/n-splv/pandas-text-comparer/raw/main/data/demo/review-responses.csv.gz")

comparer = TextComparer(df, column_a="llm_response", column_b="human_response")
comparer.run()

html = comparer.get_html()
display.HTML(html)

There's more to it, e.g. you can filter the rows, sort them by ratio, add other columns to the view etc. Feel free to read more in the repo. Hope it helps someone!

Andromada answered 22/4 at 15:43 Comment(2)

Please elaborate how this would specifically answer OP's question. Similarity screening is not the request. They want exact match more or less a number of minor -but admittedly ill-defined- differences such as trailing characters. Have you tried with their provided input instead of that repo's own demo dataset? – Corporeity 26/4 at 7:10

Hey! With the help of this library, OP will be able to quickly spot the ways his name1 and name2 may differ and based on that write a preprocessing function, which should then be applied to both texts. – Andromada 2/5 at 14:34

Recommended topics

Hot tags