Split Strings into words with multiple word boundary delimiters
Asked Answered
F

31

838

I think what I want to do is a fairly common task but I've found no reference on the web. I have text with punctuation, and I want a list of the words.

"Hey, you - what are you doing here!?"

should be

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

But Python's str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?

Feeder answered 29/6, 2009 at 17:49 Comment(2)
docs.python.org/library/re.htmlSeptavalent
python's str.split() also works with no arguments at allHandset
A
545

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Anoa answered 29/6, 2009 at 17:56 Comment(18)
Thanks. Still interested, though - how can I implement the algorithm used in this module? And why does it not appear in the string module?Feeder
I don't know why the string module doesn't have a multi-character split. Maybe it's considered complex enough to be in the realm of regular expressions. As for "how can I implement the algorithm", I'm not sure what you mean... it's there in the re module - just use it.Anoa
Regular expressions can be daunting at first, but are very powerful. The regular expression '\w+' means "a word character (a-z etc.) repeated one or more times". There's a HOWTO on Python regular expressions here: amk.ca/python/howto/regexAnoa
I got that - I don't mean how to use the re module (it's pretty complicated in itself) but how is it implemented? split() is rather straightforward to program manually, this is much more difficult...Feeder
You want to know how the re module itself works? I can't help you with that I'm afraid - I've never looked at its innards, and my Computer Science degree was a very long time ago. 8-)Anoa
I'm doing my CS1 so I've got a long way to go... It seems very difficult, at first glance, actually, harder than TSP etc. :)Feeder
@Feeder : If you are into CS, so you should want to master regex as much as a samurai would want to master a sharp sword.Overblown
The new approach will allow words which contains only ' char.Raze
This also doesn't handle unicode very well - the apostrophe used above is U+0027, which is the one on en_US keyboards. There is also U+2019, which Unicode says is the preferred apostrophe representation. I often see this character in text pasted from other sources. A regex could be written that looks for punctuation adjacent to whitespace or the beginning or end of a line. I may do that when I get a moment :)Militarist
This isn't the answer to the question. This is an answer to a different question, that happens to work for this particular situation. It's as if someone asked "how do I make a left turn" and the top-voted answer was "take the next three right turns." It works for certain intersections, but it doesn't give the needed answer. Ironically, the answer is in re, just not findall. The answer below giving re.split() is superior.Fracture
@JesseDhillon "take all substrings consisting of a sequence of word characters" and "split on all substrings consisting of a sequence of non-word characters" are literally just different ways of expressing the same operation; I'm not sure why you'd call either answer superior.Soria
This is an old post now but it is helping me today. Why the ' edit? I tried it with and without and saw no effect my windows 7 machine with Python 2.7. I also do not see that character mentioned in the regex cheat sheets I am working off. What does it do?Drysalt
@TMWP: The apostophe means that a word like don't is treated as a single word, rather than being split into don and t.Anoa
That explains it. My test sample did not include any contractions so I had nothing inherently in what I was trying to highlight why the ' was there. Thanks for clarifying. Going to change my code now. :-)Drysalt
This solution doesn't work if you want to split by non-white character.Hera
print re.findall(r"[\w\-\_']+", DATA) is more appropriate as it will include the words with hyphen and underscores within them.Prober
@JesseDhillon agreed, I'll use this answer. However, apparently three right turns is the best answer to the other question! ;-) theconversation.com/… and youtu.be/gMRp4RqEsHkAuthoritative
@SauravMukherjee not really. re.split(r'[^a-zA-Z0-9-\'_]+', DATA) would be more appropriate.Bronez
H
713

re.split()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
Honk answered 29/6, 2009 at 17:57 Comment(7)
This solution have the advantage of being easily adapted to split on underscores too, something the findall solution does not: print re.split("\W+|_", "Testing this_thing")' yields: ['Testing', 'this', 'thing']Olav
A common use case of string splitting is removing empty string entries from the final result. Is it possible to do that with this method? re.split('\W+', ' a b c ') results in ['', 'a', 'b', 'c', '']Thessa
@ScottMorken I suggest st. like [ e for e in re.split(r'\W+', ...) if e ] ... or possibly first do ' a b c '.strip()Geryon
@ArtOfWarfare It is common to use the shift key to do the opposite of something. ctrl+z undo vs. ctrl+shift+z for redo. So shift w, or W, would be the opposite of w.Tattan
Is this supposed to be r'\W+' (raw strings)?Sisyphus
@ArtOfWarfare Ah, but not always: \a means system bell character, \A means start of line. IKRCyanogen
This removes any plus and minus signs in front of numbers, which can be undesirable.Limon
A
545

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Anoa answered 29/6, 2009 at 17:56 Comment(18)
Thanks. Still interested, though - how can I implement the algorithm used in this module? And why does it not appear in the string module?Feeder
I don't know why the string module doesn't have a multi-character split. Maybe it's considered complex enough to be in the realm of regular expressions. As for "how can I implement the algorithm", I'm not sure what you mean... it's there in the re module - just use it.Anoa
Regular expressions can be daunting at first, but are very powerful. The regular expression '\w+' means "a word character (a-z etc.) repeated one or more times". There's a HOWTO on Python regular expressions here: amk.ca/python/howto/regexAnoa
I got that - I don't mean how to use the re module (it's pretty complicated in itself) but how is it implemented? split() is rather straightforward to program manually, this is much more difficult...Feeder
You want to know how the re module itself works? I can't help you with that I'm afraid - I've never looked at its innards, and my Computer Science degree was a very long time ago. 8-)Anoa
I'm doing my CS1 so I've got a long way to go... It seems very difficult, at first glance, actually, harder than TSP etc. :)Feeder
@Feeder : If you are into CS, so you should want to master regex as much as a samurai would want to master a sharp sword.Overblown
The new approach will allow words which contains only ' char.Raze
This also doesn't handle unicode very well - the apostrophe used above is U+0027, which is the one on en_US keyboards. There is also U+2019, which Unicode says is the preferred apostrophe representation. I often see this character in text pasted from other sources. A regex could be written that looks for punctuation adjacent to whitespace or the beginning or end of a line. I may do that when I get a moment :)Militarist
This isn't the answer to the question. This is an answer to a different question, that happens to work for this particular situation. It's as if someone asked "how do I make a left turn" and the top-voted answer was "take the next three right turns." It works for certain intersections, but it doesn't give the needed answer. Ironically, the answer is in re, just not findall. The answer below giving re.split() is superior.Fracture
@JesseDhillon "take all substrings consisting of a sequence of word characters" and "split on all substrings consisting of a sequence of non-word characters" are literally just different ways of expressing the same operation; I'm not sure why you'd call either answer superior.Soria
This is an old post now but it is helping me today. Why the ' edit? I tried it with and without and saw no effect my windows 7 machine with Python 2.7. I also do not see that character mentioned in the regex cheat sheets I am working off. What does it do?Drysalt
@TMWP: The apostophe means that a word like don't is treated as a single word, rather than being split into don and t.Anoa
That explains it. My test sample did not include any contractions so I had nothing inherently in what I was trying to highlight why the ' was there. Thanks for clarifying. Going to change my code now. :-)Drysalt
This solution doesn't work if you want to split by non-white character.Hera
print re.findall(r"[\w\-\_']+", DATA) is more appropriate as it will include the words with hyphen and underscores within them.Prober
@JesseDhillon agreed, I'll use this answer. However, apparently three right turns is the best answer to the other question! ;-) theconversation.com/… and youtu.be/gMRp4RqEsHkAuthoritative
@SauravMukherjee not really. re.split(r'[^a-zA-Z0-9-\'_]+', DATA) would be more appropriate.Bronez
S
511

Another quick way to do this without a regexp is to replace the characters first, as below:

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']
Shaving answered 27/8, 2011 at 16:10 Comment(8)
Quick and dirty but perfect for my case (my separators were a small, known set)Enounce
Perfect for the case where you don't have access to the RE library, such as certain small microcontrollers. :-)Bicameral
I think this is more explicit than RE as well, so it's kind of noob friendly. Sometimes don't need general solution to everythingChokeberry
Awesome. I had a .split() in a multiple input situation, and needed to catch when the user, me, separated the inputs with a space and not a comma. I was about to give up and recast with re, but your .replace() solution hit the nail on the head. Thanks.Ugrian
it will get you wrong answer when you don't want to split on spaces and you want to split on other characters.Ambulant
This is indeed tedious, but also quite slow, as the whole string is gone through for each new character to be removed. Regular expressions are really not hard for this case (see, e.g., my answer) and are arguably meant to handle this situation (they are both fast and concise and, I would say, legible in this case).Cellulitis
Much clearer than a regex. Plus, I don't really feel like importing a whole module just to perform a single, seemingly simple operation.Shiah
Pretty clever and nice solution. Might not be the most 'elegant' way to do it, but it requires no additional imports and will work with most similar cases, so in a way, it is actually pretty elegant and beautiful too.Abiding
C
406

So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient re module:

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split(r"[, \-!?:]+", "Hey, you-what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

where:

  • the […] matches one of the separators listed inside,
  • the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
  • the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched single-character separators),
  • using a raw string r"…" makes it explicit that \ in the string is to be kept as is (and does not introduce a special character)—this is useful for Python 3.12+—, and
  • filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).

This re.split() precisely "splits with multiple separators", as asked for in the question title.

This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74's answer).

The re module is much more efficient (in speed and concision) than doing Python loops and tests "by hand"!

Cellulitis answered 18/5, 2014 at 9:43 Comment(11)
"I can't find any solution that does efficiently what the title of the questions literally asks" - second answer does that, posted 5 years ago: https://mcmap.net/q/53672/-split-strings-into-words-with-multiple-word-boundary-delimiters.Backtrack
This answer does not split at delimiters (from a set of multiple delimiters): it instead splits at anything that's not alphanumeric. That said, I agree that the intent of the original poster is probably to keep only the words, instead of removing some punctuation marks.Cellulitis
The irony here is the reason this answer is not getting the most votes ... there are technically correct answers & then there is what the original requester is looking for (what they mean rather than what they say). This is a great answer and I've copied it for when I need it. And yet, for me, the top rated answer solves a problem that is very like what the poster was working on, quickly, cleanly and w/ minimal code. If a single answer had posted both solutions, I would have voted 4 that. Which 1 is better depends on what u r actually trying to do (not the "how-to" quest being asked). :-)Drysalt
@EOL I am trying to split on either > or < or =, whichever comes first in the passed string. using filter(None, re.split(">|<", feature_name)) but my output is <filter at 0x1ec49493f98> any advise on how to actually have the stringLauralee
You must be using Python 3, where filter() constructs an iterator and not a list. You can reproduce Python 2's behavior by wrapping the expression with list().Cellulitis
Plays nicely with Pandas split string method - cuteQuadrennium
@EricLebigot, what if the delimiter consists of a sequence of characters eg "--" (2 dashes), or ":=" ?Unbridled
… then you can simply list the strings that match by separating them with "pipes": "--|:=|[…]+".Cellulitis
@TWMP: agreed, but the inappropriateness of the question to the OP's actual problem at hand is a fault of the question, not the answers. The technically correct answer should be upvoted, on the basis that it best answers the question that people who come here via searching are looking for.Indigotin
lst = [s for s in re.split(r"[, \-!?:]", "Hey, you-what are you doing here!?") if s]Barbera
This works too, indeed, and might be more legible, for some. +1!Cellulitis
G
61

Another way, without regex

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()
Geaghan answered 21/7, 2009 at 6:2 Comment(5)
This solution is actually better than the accepted one. It works with no ASCII chars, try "Hey, you - what are you doing here María!?". The accepted solution will not work with the previous example.Kaleighkalends
I think there is a small issue here ... Your code will append characters that are separated with punctuation and thus won't split them ... If I'm not wrong, your last line should be: ''.join([o if not o in string.punctuation else ' ' for o in s]).split()Lenhard
The regular expression library can be made to accept Unicode conventions for characters if necessary. Additionally, this has the same problem the accepted solution used to have: as it is now, it splits on apostrophes. You may want o for o in s if (o in not string.punctuation or o == "'"), but then it's getting too complicated for a one-liner if we add in cedbeu's patch also.Suellensuelo
There is another issue here. Even when we take into account the changes of @cedbeu, this code doesn't work if the string is something like "First Name,Last Name,Street Address,City,State,Zip Code" and we want to split only on a comma ,. Desired output would be: ['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code'] What we get instead:['First', 'Name', 'Last', 'Name', 'Street', 'Address', 'City', 'State', 'Zip', 'Code']Freitas
This solution is terribly inefficient: first the list is deconstructed into individual characters, then the whole set of punctuation characters is gone through for each single characters in the original string, then the characters are assembled back, and then split again. All this "movement" is very complicated, too, compared to a regular expression-based solution: even if speed does not matter in a given application, there is no need for a complicated solution. Since the re module is standard and gives both legibility and speed, I don't see why it should be eschewed.Cellulitis
B
42

Pro-Tip: Use string.translate for the fastest string operations Python has.

Some proof...

First, the slow way (sorry pprzemek):

>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
...     res = [s]
...     for sep in seps:
...         s, res = res, []
...         for seq in s:
...             res += seq.split(sep)
...     return res
... 
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552

Next, we use re.findall() (as given by the suggested answer). MUCH faster:

>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094

Finally, we use translate:

>>> from string import translate,maketrans,punctuation 
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934

Explanation:

string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it's about as fast as you can get for string substitution.

It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!

Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!

Brachial answered 30/8, 2012 at 4:5 Comment(5)
I made a test here, and if you need to use unicode, using patt = re.compile(ur'\w+', re.UNICODE); patt.findall(S) is faster than translate, because you must encode the string before applying transform, and decode each item in the list after the split to go back to unicode.Donets
You can one-liner the translate implementation and ensure that S isn't among the splitters with: s.translate(''.join([(chr(i) if chr(i) not in seps else seps[0]) for i in range(256)])).split(seps[0])Dilks
None taken. You're comparing apples and oranges. ;) my solution in python 3 still works ;P and has support for multi-char separators. :) try doing that in simple manner without allocating a new string. :) but true, mine is limited to parsing command line params and not a book for example.Handbarrow
you say "does not produce a new string", meaning it works inplace on given string? I tested it now with python 2.7 and it does not modify oroginal string and returns new one.Lovett
string.translate and string.maketrans are not available in Python 3 but only in Python 2.Debouch
H
30

I had a similar dilemma and didn't want to use 're' module.

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']
Handbarrow answered 26/5, 2010 at 9:31 Comment(0)
D
16

First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn't significant, so I wanted to add ideas that I considered with that criteria.

My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).

Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.

Option 1 - re.sub

I was surprised to see no answer so far uses re.sub(...). I find it a simple and natural approach to this problem.

import re

my_str = "Hey, you - what are you doing here!?"

words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())

In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn't significant, so I prefer simplicity and readability.

Option 2 - str.replace

This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
for r in replacements:
    my_str = my_str.replace(r, ' ')

words = my_str.split()

It would have been nice to be able to map the str.replace to the string instead, but I don't think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)

Option 3 - functools.reduce

(In Python 2, reduce is available in global namespace without importing it from functools.)

import functools

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()
Dragone answered 10/11, 2016 at 17:31 Comment(2)
Hm, one another method is to use str.translate - it is not unicode-capable but is most likely faster than other methods and as such might be good in some cases: replacements=',-!?'; import string; my_str = my_str.translate(string.maketrans(replacements, ' ' * len(replacements))) Also here it is mandatory to have replacements as a string of characters, not tuple or list.Obcordate
@Obcordate Thanks! I mentioned that one at the top of the answer but decided not to add it since existing answers already discussed it well.Dragone
S
10
join = lambda x: sum(x,[])  # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]

Then this becomes a three-liner:

fragments = [text]
for token in tokens:
    fragments = join(f.split(token) for f in fragments)

Explanation

This is what in Haskell is known as the List monad. The idea behind the monad is that once "in the monad" you "stay in the monad" until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you'd get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you've got this operation you're applying (splitting on a token), and whenever you do that, you join the result into the list.

You can abstract this into a function and have tokens=string.punctuation by default.

Advantages of this approach:

  • This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
  • You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the "tokens" could be a function which splits according to how nested parentheses are.
Sports answered 5/5, 2011 at 8:35 Comment(5)
Neat Haskell solution, but IMO this can be written more clearly without mappend in Python.Floatplane
@Goose: the point was that the 2-line function map_then_append can be used to make a problem a 2-liner, as well as many other problems much easier to write. Most of the other solutions use the regular expression re module, which isn't python. But I have been unhappy with how I make my answer seem inelegant and bloaty when it's really concise... I'm going to edit it...Sports
is this supposed to be working in Python as-written? my fragments result is just a list of the characters in the string (including the tokens).Ermina
@RickTeachey: it works for me in both python2 and python3.Sports
hmmmm. Maybe the example is a bit ambiguous. I have tried the code in the answer all sorts of different ways- including having fragments = ['the,string'], fragments = 'the,string', or fragments = list('the,string') and none of them are producing the right output.Ermina
S
9

I like re, but here is my solution without it:

from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep.__contains__ is a method used by 'in' operator. Basically it is the same as

lambda ch: ch in sep

but is more convenient here.

groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes - a new group is generated. So, sep.__contains__ is exactly what we need.

groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using 'if not k' we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that's all - now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).

This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn't create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)

Spense answered 6/10, 2013 at 17:30 Comment(0)
B
6

Use replace two times:

a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')

results in:

['11223', '33344', '33222', '3344']
Blockade answered 30/3, 2012 at 13:27 Comment(0)
R
5

try this:

import re

phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Rao answered 29/6, 2009 at 18:1 Comment(0)
T
4

In Python 3, your can use the method from PY4E - Python for Everybody.

We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:

your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))

Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.

Your can see the "punctuation":

In [10]: import string

In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'  

For your example:

In [12]: your_str = "Hey, you - what are you doing here!?"

In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))

In [14]: line = line.lower()

In [15]: words = line.split()

In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

For more information, you can refer:

Transilient answered 15/7, 2018 at 15:9 Comment(2)
The translate() and maketrans() methods of strings are interesting, but this method fails to "split at delimiters" (or whitespace): for example, "There was a big cave-in" will incorrectly produce the word "cavein" instead of the expected "cave" and "in"… Thus, this does not do what the question asks for.Cellulitis
Just like what @EricLebigot commented. The method above does not do what the question asks for very well.Transilient
P
4

Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.

First, create a series with the above string and then apply the method to the series.

thestring = pd.Series("Hey, you - what are you doing here!?") thestring.str.split(pat = ',|-')

parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator). The output is as follows:

[Hey, you , what are you doing here!?]

Ptolemaic answered 10/9, 2018 at 15:32 Comment(1)
It&#39;s not a matter of verbose but, rather the fact of importing an entire library (which I love, BTW) to perform a simple task after converting a string to a panda series. Not very &quot;Occam friendly&quot;.Phalansterian
O
3

Another way to achieve this is to use the Natural Language Tool Kit (nltk).

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

The biggest drawback of this method is that you need to install the nltk package.

The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.

Obreption answered 29/6, 2009 at 18:51 Comment(0)
E
3

I'm re-acquainting myself with Python and needed the same thing. The findall solution may be better, but I came up with this:

tokens = [x.strip() for x in data.split(',')]
Ewe answered 20/4, 2012 at 16:53 Comment(1)
Clever, should work on all English grammatical constructs I can think of except an em-dash with no spaces—this, for example. (Workaroundable.)Sports
L
3

using maketrans and translate you can do it easily and neatly

import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()
Lynnett answered 3/3, 2018 at 23:59 Comment(1)
Great answer as for Python >= 3.6Manumission
B
2

First of all, I don't think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.

I come across this pretty frequently, and my usual solution doesn't require re.

One-liner lambda function w/ list comprehension:

(requires import string):

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']


Function (traditional)

As a traditional function, this is still only two lines with a list comprehension (in addition to import string):

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.

General Function w/o Lambda or List Comprehension

For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

Of course, you can always generalize the lambda function to any specified string of characters as well.

Berniecebernier answered 4/11, 2014 at 19:17 Comment(0)
M
1

got same problem as @ooboo and find this topic @ghostdog74 inspired me, maybe someone finds my solution usefull

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

input something in space place and split using same character if you dont want to split at spaces.

Matriarchy answered 15/3, 2011 at 10:12 Comment(1)
what if I have to split using word?Bentwood
H
1

First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.

so for your problem first compile the pattern and then perform action on it.

import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)
Housekeeper answered 2/6, 2015 at 7:6 Comment(0)
G
1

Here is the answer with some explanation.

st = "Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey  you  what are you doing here  '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

or in one line, we can do like this:

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

updated answer

Gregoor answered 4/6, 2016 at 19:35 Comment(0)
A
1

Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:

def split_string(source, splitlist):
    output = []  # output list of cleaned words
    atsplit = True
    for char in source:
        if char in splitlist:
            atsplit = True
        else:
            if atsplit:
                output.append(char)  # append new word after split
                atsplit = False
            else: 
                output[-1] = output[-1] + char  # continue copying characters until next split
    return output
Aenea answered 10/5, 2017 at 0:58 Comment(0)
R
1

I like pprzemek's solution because it does not assume that the delimiters are single characters and it doesn't try to leverage a regex (which would not work well if the number of separators got to be crazy long).

Here's a more readable version of the above solution for clarity:

def split_string_on_multiple_separators(input_string, separators):
    buffer = [input_string]
    for sep in separators:
        strings = buffer
        buffer = []  # reset the buffer
        for s in strings:
            buffer = buffer + s.split(sep)

    return buffer
Rife answered 23/5, 2019 at 17:3 Comment(0)
K
1

I had to come up with my own solution since everything I've tested so far failed at some point.

>>> import re
>>> def split_words(text):
...     rgx = re.compile(r"((?:(?<!'|\w)(?:\w-?'?)+(?<!-))|(?:(?<='|\w)(?:\w-?'?)+(?=')))")
...     return rgx.findall(text)

It seems to be working fine, at least for the examples below.

>>> split_words("The hill-tops gleam in morning's spring.")
['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
>>> split_words("I'd say it's James' 'time'.")
["I'd", 'say', "it's", "James'", 'time']
>>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
>>> split_words("google.com [email protected] split_words")
['google', 'com', 'email', 'google', 'com', 'split_words']
>>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
>>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']
Knowledge answered 23/5, 2019 at 23:6 Comment(0)
E
0

Here is my go at a split with multiple deliminaters:

def msplit( str, delims ):
  w = ''
  for z in str:
    if z not in delims:
        w += z
    else:
        if len(w) > 0 :
            yield w
        w = ''
  if len(w) > 0 :
    yield w
Eller answered 6/8, 2011 at 11:38 Comment(0)
H
0

I think the following is the best answer to suite your needs :

\W+ maybe suitable for this case, but may not be suitable for other cases.

filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
Hinman answered 9/3, 2012 at 8:30 Comment(1)
I agree, the \w and \W solutions are not an answer to (the title of) the question. Note that in your answer, | should be removed (you're thinking of expr0|expr1 instead of [char0 char1…]). Furthermore, there is no need to compile() the regular expression.Cellulitis
T
0

Heres my take on it....

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']
Tandratandy answered 29/4, 2013 at 5:32 Comment(0)
L
0
def get_words(s):
    l = []
    w = ''
    for c in s.lower():
        if c in '-!?,. ':
            if w != '': 
                l.append(w)
            w = ''
        else:
            w = w + c
    if w != '': 
        l.append(w)
    return l

Here is the usage:

>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Letter answered 24/12, 2013 at 2:17 Comment(0)
F
0

I like the replace() way the best. The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator. It also accounts for if splitlist happens to be an empty string. It returns a list of words, with no empty strings in it.

def split_string(text, splitlist):
    for sep in splitlist:
        text = text.replace(sep, splitlist[0])
    return filter(None, text.split(splitlist[0])) if splitlist else [text]
Freitas answered 7/2, 2014 at 23:15 Comment(0)
L
0

If you want a reversible operation (preserve the delimiters), you can use this function:

def tokenizeSentence_Reversible(sentence):
    setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
    listOfTokens = [sentence]

    for delimiter in setOfDelimiters:
        newListOfTokens = []
        for ind, token in enumerate(listOfTokens):
            ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
            listOfTokens = [item for sublist in ll for item in sublist] # flattens.
            listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
            newListOfTokens.extend(listOfTokens)

        listOfTokens = newListOfTokens

    return listOfTokens
Loesceke answered 22/1, 2018 at 8:25 Comment(0)
L
0

I recently needed to do this but wanted a function that somewhat matched the standard library str.split function, this function behaves the same as standard library when called with 0 or 1 arguments.

def split_many(string, *separators):
    if len(separators) == 0:
        return string.split()
    if len(separators) > 1:
        table = {
            ord(separator): ord(separator[0])
            for separator in separators
        }
        string = string.translate(table)
    return string.split(separators[0])

NOTE: This function is only useful when your separators consist of a single character (as was my usecase).

Linguiform answered 17/5, 2019 at 8:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.