Substitute multiple whitespace with single whitespace in Python [duplicate]

Asked 16/1, 2010 at 15:43 Answered 10/4, 2013 at 22:23

Solved python substitution removing-whitespace

494

I have this string:

mystring = 'Here is  some   text   I      wrote   '

How can I substitute the double, triple (...) whitespace chracters with a single space, so that I get:

mystring = 'Here is some text I wrote'

Mailand answered 16/1, 2010 at 15:43 Comment(1)

You should probably say 'substitute multiple whitespace with a single space' since whitespace is a class of characters (tabs, newlines etc.) – Ligroin 16/1, 2010 at 16:15

1032

A simple possibility (if you'd rather avoid REs) is

' '.join(mystring.split())

The split and join perform the task you're explicitly asking about -- plus, they also do the extra one that you don't talk about but is seen in your example, removing trailing spaces;-).

Pilchard answered 16/1, 2010 at 15:54 Comment(8)

Oh cool, I was fumbling with a similar solution, but using split(' ') and then a filter to remove empty elements. I never knew split with no arguments worked like this. This is also much faster, timeit.py gives me around 0.74usec for this, versus 5.75usec for regular expressions. – Trinatrinal 16/1, 2010 at 16:0

@Roman, yes, x.split() (and x.split(None)) splits on sequences of whitespace (including tabs, newlines, etc, like re's \s) of length 1+ -- and it's pretty fast indeed. So, always glad to help! – Pilchard 16/1, 2010 at 16:25

this is a very elegant solution, but I want to mention that this will also remove any linebreaks as well – Beabeach 24/8, 2015 at 0:26

str.split also considers various characters (x0b, x0c, x1c, x1d, x1e, x1f) to be whitespace, and sometimes this is not intended. – Tinware 10/2, 2020 at 1:30

Cleanest solution by far, and it seems, that is slightly (a little bit obvious) faster than doing regex, according to my tests. Seems like it doesn't apply to some specific situations like on the comments above, but you don't need to import a module to do the job, and probably, that's one of the reasons which is "slightly" faster (from 3 to 5 ms). – Woodcraft 10/4, 2020 at 15:53

To avoid '\n' from being mixed with ' ' one can use splitlines() like this: ' '.join((''.join(text.splitlines())).split()) – Joung 25/8, 2020 at 17:28

To only strip consecutive repeated spaces one can use ' '.join(mystring.split(' ')). This will also remove the leading and trailing spaces but will keep newlines, tabs, etc. – Doubtless 11/6, 2022 at 12:5

Does split() match the same white space characters as \s? – Crescendo 18/1, 2023 at 15:37

197

A regular expression can be used to offer more control over the whitespace characters that are combined.

To match unicode whitespace:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str).strip()

To match ASCII whitespace only:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"(?a:\s+)")
_RE_STRIP_WHITESPACE = re.compile(r"(?a:^\s+|\s+$)")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str)
my_str = _RE_STRIP_WHITESPACE.sub("", my_str)

Matching only ASCII whitespace is sometimes essential for keeping control characters such as x0b, x0c, x1c, x1d, x1e, x1f.

Reference:

About \s:

For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

About re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

strip() will remote any leading and trailing whitespaces.

Invigilate answered 16/1, 2010 at 15:46 Comment(1)

If you really only want to replace spaces (' '), use re.sub(' +', ' ', mystring).strip() – Drooff 16/7, 2018 at 13:14

For completeness, you can also use:

mystring = mystring.strip()  # the while loop will leave a trailing space, 
                  # so the trailing whitespace must be dealt with
                  # before or after the while loop
while '  ' in mystring:
    mystring = mystring.replace('  ', ' ')

which will work quickly on strings with relatively few spaces (faster than re in these situations).

In any scenario, Alex Martelli's split/join solution performs at least as quickly (usually significantly more so).

In your example, using the default values of timeit.Timer.repeat(), I get the following times:

str.replace: [1.4317800167340238, 1.4174888149192384, 1.4163512401715934]
re.sub:      [3.741931446594549,  3.8389395858970374, 3.973777672860706]
split/join:  [0.6530919432498195, 0.6252146571700905, 0.6346594329726258]

EDIT:

Just came across this post which provides a rather long comparison of the speeds of these methods.

Shaduf answered 10/4, 2013 at 22:23 Comment(2)

More lines than the others, and thus less "pythonic", but clearer. – Hussar 11/2, 2016 at 20:2

A reminder, this one has the risk of being infinite loop if you typo. – Spallation 24/6, 2020 at 14:23

Reference:

Recommended topics

Hot tags