Substitute multiple whitespace with single whitespace in Python [duplicate]
Asked Answered
M

3

494

I have this string:

mystring = 'Here is  some   text   I      wrote   '

How can I substitute the double, triple (...) whitespace chracters with a single space, so that I get:

mystring = 'Here is some text I wrote'
Mailand answered 16/1, 2010 at 15:43 Comment(1)
You should probably say 'substitute multiple whitespace with a single space' since whitespace is a class of characters (tabs, newlines etc.)Ligroin
P
1032

A simple possibility (if you'd rather avoid REs) is

' '.join(mystring.split())

The split and join perform the task you're explicitly asking about -- plus, they also do the extra one that you don't talk about but is seen in your example, removing trailing spaces;-).

Pilchard answered 16/1, 2010 at 15:54 Comment(8)
Oh cool, I was fumbling with a similar solution, but using split(' ') and then a filter to remove empty elements. I never knew split with no arguments worked like this. This is also much faster, timeit.py gives me around 0.74usec for this, versus 5.75usec for regular expressions.Trinatrinal
@Roman, yes, x.split() (and x.split(None)) splits on sequences of whitespace (including tabs, newlines, etc, like re's \s) of length 1+ -- and it's pretty fast indeed. So, always glad to help!Pilchard
this is a very elegant solution, but I want to mention that this will also remove any linebreaks as wellBeabeach
str.split also considers various characters (x0b, x0c, x1c, x1d, x1e, x1f) to be whitespace, and sometimes this is not intended.Tinware
Cleanest solution by far, and it seems, that is slightly (a little bit obvious) faster than doing regex, according to my tests. Seems like it doesn't apply to some specific situations like on the comments above, but you don't need to import a module to do the job, and probably, that's one of the reasons which is "slightly" faster (from 3 to 5 ms).Woodcraft
To avoid '\n' from being mixed with ' ' one can use splitlines() like this: ' '.join((''.join(text.splitlines())).split())Joung
To only strip consecutive repeated spaces one can use ' '.join(mystring.split(' ')). This will also remove the leading and trailing spaces but will keep newlines, tabs, etc.Doubtless
Does split() match the same white space characters as \s?Crescendo
I
197

A regular expression can be used to offer more control over the whitespace characters that are combined.

To match unicode whitespace:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str).strip()

To match ASCII whitespace only:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"(?a:\s+)")
_RE_STRIP_WHITESPACE = re.compile(r"(?a:^\s+|\s+$)")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str)
my_str = _RE_STRIP_WHITESPACE.sub("", my_str)

Matching only ASCII whitespace is sometimes essential for keeping control characters such as x0b, x0c, x1c, x1d, x1e, x1f.

Reference:

About \s:

For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

About re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

strip() will remote any leading and trailing whitespaces.

Invigilate answered 16/1, 2010 at 15:46 Comment(1)
If you really only want to replace spaces (' '), use re.sub(' +', ' ', mystring).strip()Drooff
S
48

For completeness, you can also use:

mystring = mystring.strip()  # the while loop will leave a trailing space, 
                  # so the trailing whitespace must be dealt with
                  # before or after the while loop
while '  ' in mystring:
    mystring = mystring.replace('  ', ' ')

which will work quickly on strings with relatively few spaces (faster than re in these situations).

In any scenario, Alex Martelli's split/join solution performs at least as quickly (usually significantly more so).

In your example, using the default values of timeit.Timer.repeat(), I get the following times:

str.replace: [1.4317800167340238, 1.4174888149192384, 1.4163512401715934]
re.sub:      [3.741931446594549,  3.8389395858970374, 3.973777672860706]
split/join:  [0.6530919432498195, 0.6252146571700905, 0.6346594329726258]


EDIT:

Just came across this post which provides a rather long comparison of the speeds of these methods.

Shaduf answered 10/4, 2013 at 22:23 Comment(2)
More lines than the others, and thus less "pythonic", but clearer.Hussar
A reminder, this one has the risk of being infinite loop if you typo.Spallation

© 2022 - 2024 — McMap. All rights reserved.