How to use \b word boundary in pandas str.contains?

J

2

6

Is there an equivalent when using str.contains?

the following code is mistakenly listing "Said Business School" in the category because of 'Sa.' If I could create a wordboundary it would solve the problem. Putting a space after messes this up. I am using pandas, which are the dfs. I know I can use regex, but just curious if i can use strings to make it faster

gprivate_n = ('Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation')
df.loc[df[df.Name.str.contains('{0}'.format(gprivate_n))].index, "Private"] = 1

Jackelinejackelyn answered 12/3, 2014 at 17:57 Comment(3)

Sorry, i am using pandas! – Jackelinejackelyn 12/3, 2014 at 18:0

Just use the regular expression word boundary… – Rittenhouse 12/3, 2014 at 18:0

@poke: need to use r'\b...' (rawstring). Same old issue that arises with regexes. – Lallygag 17/2, 2020 at 21:35

H

0

A word boundary is not a character, so you can't find it with .contains. You need to either use regex or split the strings into words and then check for membership of each of those words in the set you currently have defined in gprivate_n.

Hipparchus answered 12/3, 2014 at 21:5 Comment(3)

Word boundaries can be caught with str.contains, when using \\b instead of \b, and / or raw strings. See link and link. – Ballon 18/3, 2019 at 15:10

@PawelKranzberg: that's actually the same old issue about escaping or raw-string, so use r\b... – Lallygag 17/2, 2020 at 21:21

This is factually incorrect: \b can be used with str.contains, you just need raw-string: r'\b...' – Lallygag 17/2, 2020 at 21:37

L

6

This is just the same old Python issue in regexes where '\b' should be passed either as raw-string r'\b...'. Or less desirably, double-escaping ('\\b').

So your regex should be:

gprivate_n = (r'\b(Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation)')

Lallygag answered 17/2, 2020 at 21:25 Comment(0)

H

0

A word boundary is not a character, so you can't find it with .contains. You need to either use regex or split the strings into words and then check for membership of each of those words in the set you currently have defined in gprivate_n.

Hipparchus answered 12/3, 2014 at 21:5 Comment(3)

Word boundaries can be caught with str.contains, when using \\b instead of \b, and / or raw strings. See link and link. – Ballon 18/3, 2019 at 15:10

@PawelKranzberg: that's actually the same old issue about escaping or raw-string, so use r\b... – Lallygag 17/2, 2020 at 21:21

This is factually incorrect: \b can be used with str.contains, you just need raw-string: r'\b...' – Lallygag 17/2, 2020 at 21:37

Recommended topics

Hot tags