Split string with multiple delimiters in Python [duplicate]
Asked Answered
I

5

788

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ';' or ', ' That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]') 
Indoctrinate answered 14/2, 2011 at 23:42 Comment(0)
T
1285

Luckily, Python has this built-in :)

import re

# Regex pattern splits on substrings "; " and ", "
re.split('; |, ', string_to_split)

Update:

Following your comment:

>>> string_to_split = 'Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n', string_to_split)
['Beautiful', 'is', 'better', 'than', 'ugly']
Timpani answered 14/2, 2011 at 23:52 Comment(15)
@Paul There isn't. You aren't understanding regex properly if you think there is. See my comment on your post below.Obala
I'd prefer to write it as: re.split(r';|,\s', a) by replacing ' ' (space character) with '\s' (white space) unless space character is a strict requirement.Adenectomy
I wonder why (regular) split just can't accept a list, that seems like a more obvious way instead of encoding multiple options in a line.Incendiary
Its a regex feature, not a python one. :) And Pythons regex is a little lame (but usually good enough). But that's why we have the regex module.Lorrimor
is it possible to return the delimiters in array too? example ['Beautiful', ',' 'is' , ';' , 'better', '*' ,' than', '\n' ,'ugly']Rant
It is worth nothing that this uses some RegEx like things as mentioned above. So trying to split a string with . will split every single character. You need to escape it. \.Multiparous
Just to add to this a little bit, instead of adding a bunch of or "|" symbols you can do the following: re.split('[;,.\-\%]',str), where inside of [ ] you put all the characters you want to split by.Follansbee
Is there a way to know which delimiter is actually used for a specific split? In the above example, than and ugly are split by '\' and better and than are split by '*'.Frayne
@jmracek: Thanks for this comment, I had to use it to make my split workBerri
Is there a way to retain the delimiters in the output but combine them together? I know that doing re.split('(; |, |\*|\n)', a) will retain the delimiters, but how can I combine subsequent delimiters into one element in the output list?Gasiform
@Follansbee That's worthy of a standalone answerDianoia
Luckily you say ;)Longsufferance
For people not very familiar with regex, note that there are some common separators that have to be escaped - for example . and ? need to be escaped as \. and \?. More info can be found here: riptutorial.com/regex/example/15848/…Raul
@jonathan-livni how do I do this to user input strings such as list(map(int, input().split())) ?Mardellmarden
I voted this answer but dont like the need to list all possibilities of spaces like ",\s", ",\s\s", "\s,\s" and so on. That's why I prefer to split in each character, than throw away empty slices. [s for s in re.split(r'[;,\*\n\s]', a) if s]Revamp
M
503

Do a str.replace('; ', ', ') and then a str.split(', ')

Maltreat answered 14/2, 2011 at 23:47 Comment(8)
suppose you have a 5 delimeters, you have to traverse your string 5x timesLudewig
that is very bad for performanceStanhope
This shows a different vision of yours toward this problem. I think it is a great one. "If you don't know a direct answer, use combination of things you know to solve it".Terrell
If you have small number of delimiters and are perormance-constrained, replace trick is fastest of all. 15x faster than regexp, and almost 2x faster than nested for in val.split(...) generator.Tobytobye
what if an array has empty slots? ['6', '1862', '5', '1863', '222', '', '', '', '', '']Chil
@JuneWang one method might be to loop through the elements of the array and upon finding an empty element or an element which you desire to remove, remove that from the array by using array.remove(element)Dobsonfly
Of course you will get better performance by using re.split() if you have multiple separator characters, but this is a very smart and easy-to-understand way of solving the problem.Wingard
Performance is not always a concern. My use case was to process input from a human-entered command line argument so this solution was quite ideal. I also try to avoid regex whenever possible. Easy to create, very difficult to read.Milldam
T
191

Here's a safe way for any iterable of delimiters, using regular expressions:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join(map(re.escape, delimiters))
>>> regex_pattern
'a|\\.\\.\\.|\\(c\\)'
>>> re.split(regex_pattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]

re.escape allows to build the pattern automatically and have the delimiters escaped nicely.

Here's this solution as a function for your copy-pasting pleasure:

def split(delimiters, string, maxsplit=0):
    import re
    regex_pattern = '|'.join(map(re.escape, delimiters))
    return re.split(regex_pattern, string, maxsplit)

If you're going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.


If you'd like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regex_pattern
'(?<=a)|(?<=\\.\\.\\.)|(?<=\\(c\\))'
>>> re.split(regex_pattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]

(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)

Tsarism answered 1/11, 2012 at 20:15 Comment(0)
I
97

In response to Jonathan's answer above, this only seems to work for certain delimiters. For example:

>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']

By putting the delimiters in square brackets it seems to work more effectively.

>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']
Incongruous answered 9/1, 2013 at 10:22 Comment(2)
It works for all the delimiters you specify. A regex of - : matches exactly - : and thus won't split the date/time string. A regex of [- :] matches -, <space>, or : and thus splits the date/time string. If you want to split only on - and : then your regex should be either [-:] or -|:, and if you want to split on -, <space> and : then your regex should be either [- :] or -| |:.Obala
@Obala I see my mistake: I missed the fact that your regex contains the OR |. I blindly identified it as a desired separator.Incongruous
E
40

This is how the regex look like:

import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")

# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")

print pattern.split(text)
Ebeneser answered 14/2, 2011 at 23:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.