Tokenize by using regular expressions (parenthesis)

Asked 29/3, 2017 at 12:2 Answered 29/3, 2017 at 12:57

Solved regex string split tokenize

I have the following text:

I don't like to eat Cici's food (it is true)

I need to tokenize it to

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']

I have found out that the following regex expression (['()\w]+|\.) splits like this:

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']

How do I take the parenthesis out of the token and make it to an own token?

Thanks for ideas.

Fetus answered 29/3, 2017 at 12:2 Comment(18)

Do you plan to split or match these tokens? It might be easier to match them with \w+(?:'\w+)?|[^\w\s]. – Talich 29/3, 2017 at 12:4

what is the difference between split and match? To sum up the problem what i need is (foo) -> ["(", "foo", ")"]) – Brewmaster 29/3, 2017 at 12:7

What is the language? – Talich 29/3, 2017 at 12:7

The language is english – Brewmaster 29/3, 2017 at 12:8

I mean what programming language are you using the pattern in? – Talich 29/3, 2017 at 12:19

Programming language:python – Brewmaster 29/3, 2017 at 12:20

Great, then use re.findall(r"\w+(?:'\w+)?|[^\w\s]", s) – Talich 29/3, 2017 at 12:21

There are some quotation marks missing. Why findall?I Need to split the sentence in tokens – Brewmaster 29/3, 2017 at 12:23

Sorry, the double quoted string literal must be used, I edited the comment. It does tokenize the string. Just test and you will see. w+(?:'\w+)? will match all 1+ word char chunks followed with an optional ' followed with 1+ word char substrings, and [^\w\s] will match a single char other than word and whitespace characters. – Talich 29/3, 2017 at 12:26

well, works fine thx. So could you tell me which expression i need only for (foo) -> ["(", "foo", ")"]? I'm trying to understand what you have done – Brewmaster 29/3, 2017 at 12:43

Only for (foo) - re.findall(r'\w+|\W', s) - match 1 or more word chars (\w+), or (|) 1 non-word char (\W). But if you plan to avoid matching whitespaces (that can be matched with \W) you need to exclude them from the pattern using [^\w\s]. It is a kind of a contrast principle with exceptions. I will post an answer. – Talich 29/3, 2017 at 12:49

I added two solutions in my answer, if there is anything unclear, please let me know. – Talich 29/3, 2017 at 13:1

yes, how does re.findall(r'\w+|\W', s) look like with avoiding whitespaces is not clear – Brewmaster 29/3, 2017 at 13:4

\W matches whitespace. To subtract the \s from \W, you need to convert \W to the negated character class [^\w] (matching any char but a word char) and add \s to it - [^\w\s] that matches any char but a word and whitespace chars. – Talich 29/3, 2017 at 13:5

(foo) with [^\w\s] => ['(', ')'] – Brewmaster 29/3, 2017 at 13:9

No idea why you used just that, see ideone.com/RZTxmI. Read my answer below. – Talich 29/3, 2017 at 13:15

What do you mean? It matches (, foo and ). Look here. – Talich 29/3, 2017 at 13:21

Thx, works fine – Brewmaster 29/3, 2017 at 13:37

When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list).

Any word character is matched with \w and any non-word char is matched with \W. If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.

Use

re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)

Here, \w+(?:'\w+)? matches the words like people or people's, and [^\w\s] matches a single character other than word and whitespace character.

See the regex demo

Python demo:

import re
rx = r"\w+(?:'\w+)?|[^\w\s]"
s = "I don't like to eat Cici's food (it is true)"
print(re.findall(rx, s))

Another example that will tokenize using ( and ):

[^()\s]+|[()]

See the regex demo

Here, [^()\s]+ matches 1 or more symbols other than (, ) and whitespace, and [()] matches either ( or ).

Pervious answered 29/3, 2017 at 12:57 Comment(0)

You should separate singular char tokens (the brackets in this particular case) from the chars which represent a token in series:

([().]|['\w]+)

Demo: https://regex101.com/r/RQfYhL/2

Crash answered 29/3, 2017 at 12:4 Comment(0)

Recommended topics

Hot tags