I have two string columns in my Pandas dataset
name1 name2
John Doe John Doe
AleX T Franz K
and I need to check whether name1
equals name2
.
The naive way I use now is using a simple mask
mask=df.name1==df.name2
But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.
For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.
One idea would be to look whether name1
is contained in name2
. But it seems I cannot use str.contains
with another variable as argument. Any other ideas?
Many thanks!
EDIT: using isin
gives non-sensical results.
Example
test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
test
Out[6]:
A B
0 john doe john doe
1 john doe eddie murphy
2 John batman
test['A'].isin(test['B'])
Out[7]:
0 False
1 True
2 False
Name: A, dtype: bool
df['name1'].str.lower() == df['name2'].str.lower()
? – Pauwlesstrip
the spaces andlower
the case, see my answer – Pauwlesin
then you'd have to do this usingapply
likedf.apply(lambda x: x['A'].strip().lower() in x['B'].strip().lower(), axis=1)
– Pauwles.isin()
does not give nonsensical results, it does exactly what the doc says; if you do.isin(test['B'])
, you're comparing each LHS to the entire columntest['B']
(the entire series) not just an individual row-by-row comparison (for which you'd have to do an.apply(..., axis=1)
) with a lambda fn. (Although, that's a good pandas docbug catch.) – Puebla