I have the following text:
I don't like to eat Cici's food (it is true)
I need to tokenize it to
['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']
I have found out that the following regex expression (['()\w]+|\.)
splits like this:
['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']
How do I take the parenthesis out of the token and make it to an own token?
Thanks for ideas.
\w+(?:'\w+)?|[^\w\s]
. – Talichre.findall(r"\w+(?:'\w+)?|[^\w\s]", s)
– Talichw+(?:'\w+)?
will match all 1+ word char chunks followed with an optional'
followed with 1+ word char substrings, and[^\w\s]
will match a single char other than word and whitespace characters. – Talich(foo)
-re.findall(r'\w+|\W', s)
- match 1 or more word chars (\w+
), or (|
) 1 non-word char (\W
). But if you plan to avoid matching whitespaces (that can be matched with\W
) you need to exclude them from the pattern using[^\w\s]
. It is a kind of a contrast principle with exceptions. I will post an answer. – Talich\W
matches whitespace. To subtract the\s
from\W
, you need to convert\W
to the negated character class[^\w]
(matching any char but a word char) and add\s
to it -[^\w\s]
that matches any char but a word and whitespace chars. – Talich(
,foo
and)
. Look here. – Talich