How to properly split this list of strings?

Asked 19/2, 2017 at 17:59 Answered 19/2, 2017 at 21:4

I have a list of strings such as this :

['z+2-44', '4+55+z+88']

How can I split this strings in the list such that it would be something like

[['z','+','2','-','44'],['4','+','55','+','z','+','88']]

I have tried using the split method already however that splits the 44 into 4 and 4, and am not sure what else to try.

Shere answered 19/2, 2017 at 17:59 Comment(6)

The specification is incomplete I guess. What about math operators * and /? What about variables a, b, and c? Is pi a constant, a variable or p*i? The question as given will attract answers that might not really be helpful for all your cases. – Avie 19/2, 2017 at 19:20

@martineau I believe that this question is not a proper duplicate. – Changsha 19/2, 2017 at 21:8

@Kasramvd: I'd be interested in hearing why you think that. – Internalcombustion 19/2, 2017 at 21:33

@Internalcombustion Because answering this question doesn't need a knowledge about regex, necessarily. Also it's not only about string processing either, it's a list containing strings. As you can see these in my answer. I also mentioned the proper usage of the regex as well. – Changsha 19/2, 2017 at 21:43

@Kasramvd: While it's certainly possible to solve the problem without using regular expressions, it's really a poor way to do it (and possibly an excuse to not learn how to use regular expressions if one doesn't know already). However, if you feel strongly that the question being marked as a duplicate was wrong, feel free to reopen it yourself (or at least vote to reopen it). – Internalcombustion 19/2, 2017 at 21:55

@Internalcombustion I think there is another similar question like #18464888 But I think this question is simpler and can be solved in simpler ways too. Any way I also updated my answer with another way using tokenizer module. – Changsha 19/2, 2017 at 23:33

You can use regex:

import re
lst = ['z+2-44', '4+55+z+88']
[re.findall('\w+|\W+', s) for s in lst]
# [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

\w+|\W+ matches a pattern that consists either of word characters (alphanumeric values in your case) or non word characters (+- signs in your case).

Collegian answered 19/2, 2017 at 18:3 Comment(0)

That will work, using itertools.groupby

z = ['z+2-44', '4+55+z+88']

print([["".join(x) for k,x in itertools.groupby(i,str.isalnum)] for i in z])

output:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

It just groups the chars if they're alphanumerical (or not), just join them back in a list comprehension.

EDIT: the general case of a calculator with parenthesis has been asked as a follow-up question here. If z is as follows:

z = ['z+2-44', '4+55+((z+88))']

then with the previous grouping we get:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+((', 'z', '+', '88', '))']]

Which is not easy to parse in terms of tokens. So a change would be to join only if alphanum, and let as list if not, flattening in the end using chain.from_iterable:

print([list(itertools.chain.from_iterable(["".join(x)] if k else x for k,x in itertools.groupby(i,str.isalnum))) for i in z])

which yields:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', '(', '(', 'z', '+', '88', ')', ')']]

(note that the alternate regex answer can also be adapted like this: [re.findall('\w+|\W', s) for s in lst] (note the lack of + after W)

also "".join(list(x)) is slightly faster than "".join(x), but I'll let you add it up to avoid altering visibility of that already complex expression.

Beaux answered 19/2, 2017 at 18:5 Comment(2)

You beated me by 3 secs :P – Earthstar 19/2, 2017 at 18:5

you're not a sore loser as I see :) thanks for the edit – Munich 19/2, 2017 at 18:6

Alternative solution using re.split function:

l = ['z+2-44', '4+55+z+88']
print([list(filter(None, re.split(r'(\w+)', i))) for i in l])

The output:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

Briefless answered 19/2, 2017 at 18:14 Comment(0)

You could only use str.replace() and str.split() built-in functions within a list comprehension:

In [34]: lst = ['z+2-44', '4+55+z+88']

In [35]: [s.replace('+', ' + ').replace('-', ' - ').split() for s in lst]
Out[35]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

But note that this is not an efficient approach for longer strings. In that case the best way to go is using regex.

As another pythonic way you can also use tokenize module:

In [56]: from io import StringIO

In [57]: import tokenize

In [59]: [[t.string for t in tokenize.generate_tokens(StringIO(i).readline)][:-1] for i in lst]
Out[59]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays.

Changsha answered 19/2, 2017 at 21:4 Comment(0)

-1

If you want to stick with split (hence avoiding regex), you can provide it with an optional character to split on:

>>> testing = 'z+2-44'
>>> testing.split('+')
['z', '2-44']
>>> testing.split('-')
['z+2', '44']

So, you could whip something up by chaining the split commands.

However, using regular expressions would probably be more readable:

import re

>>> re.split('\+|\-', testing)
['z', '2', '44']

This is just saying to "split the string at any + or - character" (the backslashes are escape characters because both of those have special meaning in a regex.

Lastly, in this particular case, I imagine the goal is something along the lines of "split at every non-alpha numeric character", in which case regex can still save the day:

>>> re.split('[^a-zA-Z0-9]', testing)
['z', '2', '44']

It is of course worth noting that there are a million other solutions, as discussed in some other SO discussions.

Python: Split string with multiple delimiters

Split Strings with Multiple Delimiters?

My answers here are targeted towards simple, readable code and not performance, in honor of Donald Knuth

Robbert answered 19/2, 2017 at 18:19 Comment(2)

Asker wants those signs to be in resulted list as well. not just z 2 44. – Nonscheduled 19/2, 2017 at 18:22

Ah yes, should have read the question better. I would update the answer but I see it has already been answered at this point. Carry on! – Robbert 19/2, 2017 at 18:26

Recommended topics

Hot tags