Tokenize by using regular expressions (parenthesis)
Asked Answered
F

2

7

I have the following text:

I don't like to eat Cici's food (it is true)

I need to tokenize it to

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']

I have found out that the following regex expression (['()\w]+|\.) splits like this:

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']

How do I take the parenthesis out of the token and make it to an own token?

Thanks for ideas.

Fetus answered 29/3, 2017 at 12:2 Comment(18)
Do you plan to split or match these tokens? It might be easier to match them with \w+(?:'\w+)?|[^\w\s].Talich
what is the difference between split and match? To sum up the problem what i need is (foo) -> ["(", "foo", ")"])Brewmaster
What is the language?Talich
The language is englishBrewmaster
I mean what programming language are you using the pattern in?Talich
Programming language:pythonBrewmaster
Great, then use re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)Talich
There are some quotation marks missing. Why findall?I Need to split the sentence in tokensBrewmaster
Sorry, the double quoted string literal must be used, I edited the comment. It does tokenize the string. Just test and you will see. w+(?:'\w+)? will match all 1+ word char chunks followed with an optional ' followed with 1+ word char substrings, and [^\w\s] will match a single char other than word and whitespace characters.Talich
well, works fine thx. So could you tell me which expression i need only for (foo) -> ["(", "foo", ")"]? I'm trying to understand what you have doneBrewmaster
Only for (foo) - re.findall(r'\w+|\W', s) - match 1 or more word chars (\w+), or (|) 1 non-word char (\W). But if you plan to avoid matching whitespaces (that can be matched with \W) you need to exclude them from the pattern using [^\w\s]. It is a kind of a contrast principle with exceptions. I will post an answer.Talich
I added two solutions in my answer, if there is anything unclear, please let me know.Talich
yes, how does re.findall(r'\w+|\W', s) look like with avoiding whitespaces is not clearBrewmaster
\W matches whitespace. To subtract the \s from \W, you need to convert \W to the negated character class [^\w] (matching any char but a word char) and add \s to it - [^\w\s] that matches any char but a word and whitespace chars.Talich
(foo) with [^\w\s] => ['(', ')']Brewmaster
No idea why you used just that, see ideone.com/RZTxmI. Read my answer below.Talich
What do you mean? It matches (, foo and ). Look here.Talich
Thx, works fineBrewmaster
P
6

When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list).

Any word character is matched with \w and any non-word char is matched with \W. If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.

Use

re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)

Here, \w+(?:'\w+)? matches the words like people or people's, and [^\w\s] matches a single character other than word and whitespace character.

See the regex demo

Python demo:

import re
rx = r"\w+(?:'\w+)?|[^\w\s]"
s = "I don't like to eat Cici's food (it is true)"
print(re.findall(rx, s))

Another example that will tokenize using ( and ):

[^()\s]+|[()]

See the regex demo

Here, [^()\s]+ matches 1 or more symbols other than (, ) and whitespace, and [()] matches either ( or ).

Pervious answered 29/3, 2017 at 12:57 Comment(0)
C
0

You should separate singular char tokens (the brackets in this particular case) from the chars which represent a token in series:

([().]|['\w]+)

Demo: https://regex101.com/r/RQfYhL/2

Crash answered 29/3, 2017 at 12:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.