D

8

412

I sometimes want to match whitespace but not newline.

So far I've been resorting to [ \t]. Is there a less awkward way?

Dani answered 12/8, 2010 at 15:0 Comment(3)

BTW, these characters are also "whitespace": [\r\f]. – Mezzotint 12/8, 2010 at 15:12

@eugeney is anyone still doing form feeds? (\f's) – Pantelleria 21/11, 2011 at 0:37

@AranMulholland: Anyone who has a character-oriented printer. Most printers have a character mode as well as PostScript or whatever the Hewlett Packard interface is called, and to throw a page you send a form feed. – Starknaked 6/7, 2016 at 11:23

S

231

Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v and \h, as well as the generic whitespace character class \s

The cleanest solution is to use the horizontal whitespace character class \h. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters

U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)

U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

The vertical space pattern \v is less useful, but matches these characters

U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)

U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

There are seven vertical whitespace characters which match \v and eighteen horizontal ones which match \h. \s matches twenty-three characters

All whitespace characters are either vertical or horizontal with no overlap, but they are not proper subsets because \h also matches U+00A0 NO-BREAK SPACE, and \v also matches U+0085 NEXT LINE, neither of which are matched by \s

Starknaked answered 21/9, 2014 at 7:36 Comment(16)

\h works only on the languages which supports PCRE . – Cornelison 21/9, 2014 at 17:1

@AvinashRaj: This question is about Perl, which certainly supports PCRE – Starknaked 21/9, 2014 at 22:36

@AleksandrDubinsky this blank POSIX notation [[:blank:]] will work on most of the languages. – Cornelison 26/12, 2014 at 4:17

@AvinashRaj: Except that [[:blank:]] doesn't match no-break space --   or "\xA0" – Starknaked 19/1, 2015 at 16:51

Wanna mention that \h worked perfectly for my use case which was doing a find/replace in Notepad++ on 1 or more contiguous non-new-line spaces. Nothing else (simple) worked. – Fari 10/3, 2015 at 20:35

ICU has` \h` so this is pretty standard. – Dail 13/12, 2015 at 9:9

@Starknaked POSIX blank should match NO-BREAK SPACE in any engine that supports Unicode regular expressions. It is defined in Annex C: Compatibility Properties of Unicode Regular Expressions – Longsighted 3/2, 2016 at 17:53

What makes Perl's \h slightly non-standard is its inclusion of MONGOLIAN VOWEL SEPARATOR. Unicode does not consider it whitespace. For that reason, Perl \h differs from POSIX blank ([[:blank:]] in Perl, \p{Blank} in Java) and Java 8 \h. Admittedly, it's an edge case. – Longsighted 3/2, 2016 at 18:7

For more information on what Unicode considers whitespace (and what it doesn't), see the table in en.wikipedia.org/wiki/White-space_character – Longsighted 3/2, 2016 at 18:8

A table of which regex engines support \h and POSIX blank: regular-expressions.info/refcharclass.html – Longsighted 3/2, 2016 at 19:24

@AleksandrDubinsky: It looks like Perl has been fixed with regard to including MONGOLIAN VOWEL SEPARATOR in \h. See my revised solution above. I can't see it mentioned in any of the fixes so I can't offer a safe version number, but I will keep looking – Starknaked 6/7, 2016 at 12:10

Why do I get Unrecognized escape \h passed through when I try to use this? – Ximenes 28/8, 2018 at 14:15

I'm on Perl 5.16.3, regarding Unrecognized escape \h passed through. Why? – Ximenes 28/8, 2018 at 14:24

In atom editor, \h+ can match spaces correctly, but I cannot replace it with a comma for example somehow. Used [^\S\r\n]+ from the answer below eventually. – Glyceride 29/3, 2019 at 10:53

bad escape \h on python :( – Sturdivant 15/12, 2021 at 20:39

@Glyceride I checked it in PCRE regex and it works great, check this demo, scroll down to see the substitution. – Evania 26/1 at 14:31

Y

554

Summary

Use \h to match horizontal whitespace, in perl since v5.10.0 (released in 2007)
For non-PCRE engines, use a double-negative: [^\S\r\n]
Unicode properties: \p{Blank} or \p{HorizSpace}
Be direct, in ASCII: [\t\f\cK ]
Be direct, in Unicode (but don’t, really)
Other applications of double-negatives and Unicode properties

Horizontal Whitespace

The “Character Classes and other Special Escapes” section of perlre includes

\h Horizontal whitespace

\H Not horizontal whitespace

Double-Negative

If you might use your pattern with other engines, particularly ones that are not Perl-compatible or otherwise don’t support \h, express it as a double-negative:

[^\S\r\n]

That is, not-not-whitespace (the capital S complements) or not-carriage-return or not-newline. Distributing the outer not (i.e., the complementing ^ in the bracketed character class) with De Morgan’s law, this is equivalent to subtracting \r and \n from \s. Including both carriage return and newline in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and DOS-ish (CRLF) newline conventions.

No need to take my word for it:

#! /usr/bin/env perl

use strict;
use warnings;

my $ws_not_crlf = qr/[^\S\r\n]/;

for (' ', '\f', '\t', '\r', '\n') {
  my $qq = qq["$_"];
  printf "%-4s => %s\n", $qq,
    (eval $qq) =~ $ws_not_crlf ? "match" : "no match";
}

Output:

" "  => match
"\f" => match
"\t" => match
"\r" => no match
"\n" => no match

Note the exclusion of vertical tab, but this is addressed in v5.18.

Before objecting too harshly, the Perl documentation uses the same technique. A footnote in the “Whitespace” section of perlrecharclass reads

Prior to Perl v5.18, \s did not match the vertical tab. [^\S\cK] (obscurely) matches what \s traditionally did.

Unicode Properties

The aforementioned perlre documentation on \h and \H references the perlunicode documentation where we read about a family of useful Unicode properties.

\p{Blank}

This is the same as \h and \p{HorizSpace}: A character that changes the spacing horizontally.

\p{HorizSpace}

This is the same as \h and \p{Blank}: a character that changes the spacing horizontally.

The Direct Approach: ASCII Edition

The “Whitespace” section of perlrecharclass also suggests other approaches that won’t offend grammar instructors’ opposition to double-negatives.

Say what you want rather than what you don’t.

Outside locale and Unicode rules or when the /a or /aa switch is in effect, “\s matches [\t\n\f\r ] and, starting in Perl v5.18, the vertical tab, \cK.”

To match whitespace but not newlines (broadly), discard \r and \n to leave

[\t\f\cK ]

The Direct Approach: Unicode Edition

If your text is Unicode, use code similar to the sub below to construct a pattern from the table in the “Whitespace” section of perlrecharclass.

sub ws_not_nl {
  local($_) = <<'EOTable';
0x0009        CHARACTER TABULATION   h s
0x000a              LINE FEED (LF)    vs
0x000b             LINE TABULATION    vs  [1]
0x000c              FORM FEED (FF)    vs
0x000d        CARRIAGE RETURN (CR)    vs
0x0020                       SPACE   h s
0x0085             NEXT LINE (NEL)    vs  [2]
0x00a0              NO-BREAK SPACE   h s  [2]
0x1680            OGHAM SPACE MARK   h s
0x2000                     EN QUAD   h s
0x2001                     EM QUAD   h s
0x2002                    EN SPACE   h s
0x2003                    EM SPACE   h s
0x2004          THREE-PER-EM SPACE   h s
0x2005           FOUR-PER-EM SPACE   h s
0x2006            SIX-PER-EM SPACE   h s
0x2007                FIGURE SPACE   h s
0x2008           PUNCTUATION SPACE   h s
0x2009                  THIN SPACE   h s
0x200a                  HAIR SPACE   h s
0x2028              LINE SEPARATOR    vs
0x2029         PARAGRAPH SEPARATOR    vs
0x202f       NARROW NO-BREAK SPACE   h s
0x205f   MEDIUM MATHEMATICAL SPACE   h s
0x3000           IDEOGRAPHIC SPACE   h s
EOTable

  my $class;
  while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) {
    my($hex,$name) = ($1,$2);
    next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/;
    $class .= "\\N{U+$hex}";
  }

  qr/[$class]/u;
}

This above is for completeness. Use the Unicode properties rather than writing it out longhand.

Other Applications

The double-negative trick is also handy for matching alphabetic characters too. Remember that \w matches “word characters,” alphabetic characters and digits and underscore. We ugly-Americans sometimes want to write it as, say,

if (/[A-Za-z]+/) { ... }

but a double-negative character-class can respect the locale:

if (/[^\W\d_]+/) { ... }

Expressing “a word character but not digit or underscore” this way is a bit opaque. A POSIX character-class communicates the intent more directly

if (/[[:alpha:]]+/) { ... }

or with a Unicode property as szbalint suggested

if (/\p{Letter}+/) { ... }

Pingui asked about nesting the double-negative character class to effectively modify the \s in

/(\+|0|\()[\d()\s-]{6,20}\d/g

The best I could come up with is to use | for an alternative and move the \s to the other branch:

/(\+|0|\()(?:[\d()-]|[^\S\r\n]){6,20}\d/g

Yseulte answered 12/8, 2010 at 15:7 Comment(7)

Clever, but the behavior is very surprising, and I don't see how it's less awkward. – Ninette 12/8, 2010 at 16:4

@Qwertie: what's surprising? Less awkward than what? – Poteen 12/8, 2010 at 16:6

How can I nest this expression within another one? E.g. replace "\s" with it in /(\+|0|\()[\d()\s-]{6,20}\d/g? Thx – Stamford 17/8, 2014 at 17:52

In Python, make sure you use this with flags=re.UNICODE. – Ophicleide 10/6, 2019 at 4:17

VSCode Find and Replace doesn't support \h probably because it's something other than PCRE, but this [nice] answer worked for me, thanks. – Slapstick 5/4, 2020 at 18:13

If anyone is using VBScript.regexp and is fed with a weird mix of spaces, you may wish to list out all spaces instead with /[\x09\x20\xA0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000]/ as the proposed solution which should work doesn't worked for me – Alkyl 12/10, 2022 at 2:56

The double negative is useful in JavaScript. – Burgas 4/3 at 6:48

S

231

Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v and \h, as well as the generic whitespace character class \s

The cleanest solution is to use the horizontal whitespace character class \h. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters

U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)

U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

The vertical space pattern \v is less useful, but matches these characters

U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)

U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR