F

31

838

I think what I want to do is a fairly common task but I've found no reference on the web. I have text with punctuation, and I want a list of the words.

"Hey, you - what are you doing here!?"

should be

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

But Python's str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?

Feeder answered 29/6, 2009 at 17:49 Comment(2)

docs.python.org/library/re.html – Septavalent 29/6, 2009 at 18:3

python's str.split() also works with no arguments at all – Handset 8/5, 2018 at 9:4

A

545

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Anoa answered 29/6, 2009 at 17:56 Comment(18)

Thanks. Still interested, though - how can I implement the algorithm used in this module? And why does it not appear in the string module? – Feeder 29/6, 2009 at 18:6

I don't know why the string module doesn't have a multi-character split. Maybe it's considered complex enough to be in the realm of regular expressions. As for "how can I implement the algorithm", I'm not sure what you mean... it's there in the re module - just use it. – Anoa 29/6, 2009 at 20:6

Regular expressions can be daunting at first, but are very powerful. The regular expression '\w+' means "a word character (a-z etc.) repeated one or more times". There's a HOWTO on Python regular expressions here: amk.ca/python/howto/regex – Anoa 4/7, 2009 at 19:44

I got that - I don't mean how to use the re module (it's pretty complicated in itself) but how is it implemented? split() is rather straightforward to program manually, this is much more difficult... – Feeder 6/7, 2009 at 14:32

You want to know how the re module itself works? I can't help you with that I'm afraid - I've never looked at its innards, and my Computer Science degree was a very long time ago. 8-) – Anoa 6/7, 2009 at 15:57

I'm doing my CS1 so I've got a long way to go... It seems very difficult, at first glance, actually, harder than TSP etc. :) – Feeder 8/7, 2009 at 14:41

@Feeder : If you are into CS, so you should want to master regex as much as a samurai would want to master a sharp sword. – Overblown 13/11, 2011 at 23:19

The new approach will allow words which contains only ' char. – Raze 21/5, 2013 at 8:38

This also doesn't handle unicode very well - the apostrophe used above is U+0027, which is the one on en_US keyboards. There is also U+2019, which Unicode says is the preferred apostrophe representation. I often see this character in text pasted from other sources. A regex could be written that looks for punctuation adjacent to whitespace or the beginning or end of a line. I may do that when I get a moment :) – Militarist 12/7, 2013 at 15:24

This isn't the answer to the question. This is an answer to a different question, that happens to work for this particular situation. It's as if someone asked "how do I make a left turn" and the top-voted answer was "take the next three right turns." It works for certain intersections, but it doesn't give the needed answer. Ironically, the answer is in re, just not findall. The answer below giving re.split() is superior. – Fracture 9/9, 2013 at 18:47

@JesseDhillon "take all substrings consisting of a sequence of word characters" and "split on all substrings consisting of a sequence of non-word characters" are literally just different ways of expressing the same operation; I'm not sure why you'd call either answer superior. – Soria 13/7, 2015 at 17:51

This is an old post now but it is helping me today. Why the ' edit? I tried it with and without and saw no effect my windows 7 machine with Python 2.7. I also do not see that character mentioned in the regex cheat sheets I am working off. What does it do? – Drysalt 21/3, 2017 at 12:40

@TMWP: The apostophe means that a word like don't is treated as a single word, rather than being split into don and t. – Anoa 21/3, 2017 at 14:48

That explains it. My test sample did not include any contractions so I had nothing inherently in what I was trying to highlight why the ' was there. Thanks for clarifying. Going to change my code now. :-) – Drysalt 21/3, 2017 at 17:36

This solution doesn't work if you want to split by non-white character. – Hera 18/10, 2018 at 18:25

print re.findall(r"[\w\-\_']+", DATA) is more appropriate as it will include the words with hyphen and underscores within them. – Prober 12/11, 2018 at 7:4

@JesseDhillon agreed, I'll use this answer. However, apparently three right turns is the best answer to the other question! ;-) theconversation.com/… and youtu.be/gMRp4RqEsHk – Authoritative 7/7, 2019 at 6:51

@SauravMukherjee not really. re.split(r'[^a-zA-Z0-9-\'_]+', DATA) would be more appropriate. – Bronez 14/6, 2021 at 7:53

H

713

re.split()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

Honk answered 29/6, 2009 at 17:57 Comment(7)

This solution have the advantage of being easily adapted to split on underscores too, something the findall solution does not: print re.split("\W+|_", "Testing this_thing")' yields: ['Testing', 'this', 'thing'] – Olav 5/1, 2012 at 0:26

A common use case of string splitting is removing empty string entries from the final result. Is it possible to do that with this method? re.split('\W+', ' a b c ') results in ['', 'a', 'b', 'c', ''] – Thessa 6/12, 2017 at 22:38

@ScottMorken I suggest st. like [ e for e in re.split(r'\W+', ...) if e ] ... or possibly first do ' a b c '.strip() – Geryon 8/2, 2018 at 14:39

@ArtOfWarfare It is common to use the shift key to do the opposite of something. ctrl+z undo vs. ctrl+shift+z for redo. So shift w, or W, would be the opposite of w. – Tattan 17/9, 2018 at 16:46

Is this supposed to be r'\W+' (raw strings)? – Sisyphus 28/11, 2018 at 10:39

@ArtOfWarfare Ah, but not always: \a means system bell character, \A means start of line. IKR – Cyanogen 1/4, 2019 at 23:35

This removes any plus and minus signs in front of numbers, which can be undesirable. – Limon 20/10, 2022 at 19:59

A

545

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']