I sometimes want to match whitespace but not newline.
So far I've been resorting to [ \t]
. Is there a less awkward way?
I sometimes want to match whitespace but not newline.
So far I've been resorting to [ \t]
. Is there a less awkward way?
Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v
and \h
, as well as the generic whitespace character class \s
The cleanest solution is to use the horizontal whitespace character class \h
. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters
U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
The vertical space pattern \v
is less useful, but matches these characters
U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
There are seven vertical whitespace characters which match \v
and eighteen horizontal ones which match \h
. \s
matches twenty-three characters
All whitespace characters are either vertical or horizontal with no overlap, but they are not proper subsets because \h
also matches U+00A0 NO-BREAK SPACE, and \v
also matches U+0085 NEXT LINE, neither of which are matched by \s
\h
works only on the languages which supports PCRE
. –
Cornelison [[:blank:]]
will work on most of the languages. –
Cornelison [[:blank:]]
doesn't match no-break space --
or "\xA0"
–
Starknaked \h
worked perfectly for my use case which was doing a find/replace in Notepad++ on 1 or more contiguous non-new-line spaces. Nothing else (simple) worked. –
Fari blank
should match NO-BREAK SPACE
in any engine that supports Unicode regular expressions. It is defined in Annex C: Compatibility Properties of Unicode Regular Expressions –
Longsighted \h
slightly non-standard is its inclusion of MONGOLIAN VOWEL SEPARATOR
. Unicode does not consider it whitespace. For that reason, Perl \h
differs from POSIX blank
([[:blank:]]
in Perl, \p{Blank}
in Java) and Java 8 \h
. Admittedly, it's an edge case. –
Longsighted \h
and POSIX blank
: regular-expressions.info/refcharclass.html –
Longsighted MONGOLIAN VOWEL SEPARATOR
in \h
. See my revised solution above. I can't see it mentioned in any of the fixes so I can't offer a safe version number, but I will keep looking –
Starknaked Unrecognized escape \h passed through
when I try to use this? –
Ximenes Unrecognized escape \h passed through
. Why? –
Ximenes bad escape \h
on python :( –
Sturdivant \h
to match horizontal whitespace, in perl since v5.10.0 (released in 2007)[^\S\r\n]
\p{Blank}
or \p{HorizSpace}
[\t\f\cK ]
The “Character Classes and other Special Escapes” section of perlre includes
\h
Horizontal whitespace\H
Not horizontal whitespace
If you might use your pattern with other engines, particularly ones that are not Perl-compatible or otherwise don’t support \h
, express it as a double-negative:
[^\S\r\n]
That is, not-not-whitespace (the capital S
complements) or not-carriage-return or not-newline. Distributing the outer not (i.e., the complementing ^
in the bracketed character class) with De Morgan’s law, this is equivalent to subtracting \r
and \n
from \s
. Including both carriage return and newline in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and DOS-ish (CRLF) newline conventions.
No need to take my word for it:
#! /usr/bin/env perl
use strict;
use warnings;
my $ws_not_crlf = qr/[^\S\r\n]/;
for (' ', '\f', '\t', '\r', '\n') {
my $qq = qq["$_"];
printf "%-4s => %s\n", $qq,
(eval $qq) =~ $ws_not_crlf ? "match" : "no match";
}
Output:
" " => match "\f" => match "\t" => match "\r" => no match "\n" => no match
Note the exclusion of vertical tab, but this is addressed in v5.18.
Before objecting too harshly, the Perl documentation uses the same technique. A footnote in the “Whitespace” section of perlrecharclass reads
Prior to Perl v5.18,
\s
did not match the vertical tab.[^\S\cK]
(obscurely) matches what\s
traditionally did.
The aforementioned perlre documentation on \h
and \H
references the perlunicode documentation where we read about a family of useful Unicode properties.
\p{Blank}
- This is the same as
\h
and\p{HorizSpace}
: A character that changes the spacing horizontally.\p{HorizSpace}
- This is the same as
\h
and\p{Blank}
: a character that changes the spacing horizontally.
The “Whitespace” section of perlrecharclass also suggests other approaches that won’t offend grammar instructors’ opposition to double-negatives.
Say what you want rather than what you don’t.
Outside locale and Unicode rules or when the /a
or /aa
switch is in effect, “\s
matches [\t\n\f\r ]
and, starting in Perl v5.18, the vertical tab, \cK
.”
To match whitespace but not newlines (broadly), discard \r
and \n
to leave
[\t\f\cK ]
If your text is Unicode, use code similar to the sub below to construct a pattern from the table in the “Whitespace” section of perlrecharclass.
sub ws_not_nl {
local($_) = <<'EOTable';
0x0009 CHARACTER TABULATION h s
0x000a LINE FEED (LF) vs
0x000b LINE TABULATION vs [1]
0x000c FORM FEED (FF) vs
0x000d CARRIAGE RETURN (CR) vs
0x0020 SPACE h s
0x0085 NEXT LINE (NEL) vs [2]
0x00a0 NO-BREAK SPACE h s [2]
0x1680 OGHAM SPACE MARK h s
0x2000 EN QUAD h s
0x2001 EM QUAD h s
0x2002 EN SPACE h s
0x2003 EM SPACE h s
0x2004 THREE-PER-EM SPACE h s
0x2005 FOUR-PER-EM SPACE h s
0x2006 SIX-PER-EM SPACE h s
0x2007 FIGURE SPACE h s
0x2008 PUNCTUATION SPACE h s
0x2009 THIN SPACE h s
0x200a HAIR SPACE h s
0x2028 LINE SEPARATOR vs
0x2029 PARAGRAPH SEPARATOR vs
0x202f NARROW NO-BREAK SPACE h s
0x205f MEDIUM MATHEMATICAL SPACE h s
0x3000 IDEOGRAPHIC SPACE h s
EOTable
my $class;
while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) {
my($hex,$name) = ($1,$2);
next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/;
$class .= "\\N{U+$hex}";
}
qr/[$class]/u;
}
This above is for completeness. Use the Unicode properties rather than writing it out longhand.
The double-negative trick is also handy for matching alphabetic characters too. Remember that \w
matches “word characters,” alphabetic characters and digits and underscore. We ugly-Americans sometimes want to write it as, say,
if (/[A-Za-z]+/) { ... }
but a double-negative character-class can respect the locale:
if (/[^\W\d_]+/) { ... }
Expressing “a word character but not digit or underscore” this way is a bit opaque. A POSIX character-class communicates the intent more directly
if (/[[:alpha:]]+/) { ... }
or with a Unicode property as szbalint suggested
if (/\p{Letter}+/) { ... }
Pingui asked about nesting the double-negative character class to effectively modify the \s
in
/(\+|0|\()[\d()\s-]{6,20}\d/g
The best I could come up with is to use |
for an alternative and move the \s
to the other branch:
/(\+|0|\()(?:[\d()-]|[^\S\r\n]){6,20}\d/g
\s
" with it in /(\+|0|\()[\d()\s-]{6,20}\d/g
? Thx –
Stamford flags=re.UNICODE
. –
Ophicleide /[\x09\x20\xA0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000]/
as the proposed solution which should work doesn't worked for me –
Alkyl Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v
and \h
, as well as the generic whitespace character class \s
The cleanest solution is to use the horizontal whitespace character class \h
. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters
U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
The vertical space pattern \v
is less useful, but matches these characters
U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
There are seven vertical whitespace characters which match \v
and eighteen horizontal ones which match \h
. \s
matches twenty-three characters
All whitespace characters are either vertical or horizontal with no overlap, but they are not proper subsets because \h
also matches U+00A0 NO-BREAK SPACE, and \v
also matches U+0085 NEXT LINE, neither of which are matched by \s
\h
works only on the languages which supports PCRE
. –
Cornelison [[:blank:]]
will work on most of the languages. –
Cornelison [[:blank:]]
doesn't match no-break space --
or "\xA0"
–
Starknaked \h
worked perfectly for my use case which was doing a find/replace in Notepad++ on 1 or more contiguous non-new-line spaces. Nothing else (simple) worked. –
Fari blank
should match NO-BREAK SPACE
in any engine that supports Unicode regular expressions. It is defined in Annex C: Compatibility Properties of Unicode Regular Expressions –
Longsighted \h
slightly non-standard is its inclusion of MONGOLIAN VOWEL SEPARATOR
. Unicode does not consider it whitespace. For that reason, Perl \h
differs from POSIX blank
([[:blank:]]
in Perl, \p{Blank}
in Java) and Java 8 \h
. Admittedly, it's an edge case. –
Longsighted \h
and POSIX blank
: regular-expressions.info/refcharclass.html –
Longsighted MONGOLIAN VOWEL SEPARATOR
in \h
. See my revised solution above. I can't see it mentioned in any of the fixes so I can't offer a safe version number, but I will keep looking –
Starknaked Unrecognized escape \h passed through
when I try to use this? –
Ximenes Unrecognized escape \h passed through
. Why? –
Ximenes bad escape \h
on python :( –
Sturdivant A variation on Greg’s answer that includes carriage returns too:
/[^\S\r\n]/
This regex is safer than /[^\S\n]/
with no \r
. My reasoning is that Windows uses \r\n
for newlines, and Mac OS 9 used \r
. You’re unlikely to find \r
without \n
nowadays, but if you do find it, it couldn’t mean anything but a newline. Thus, since \r
can mean a newline, we should exclude it too.
The below regex would match white spaces but not of a new line character.
(?:(?!\n)\s)
If you want to add carriage return also then add \r
with the |
operator inside the negative lookahead.
(?:(?![\n\r])\s)
Add +
after the non-capturing group to match one or more white spaces.
(?:(?![\n\r])\s)+
I don't know why you people failed to mention the POSIX character class [[:blank:]]
which matches any horizontal whitespaces (spaces and tabs). This POSIX chracter class would work on BRE(Basic REgular Expressions), ERE(Extended Regular Expression), PCRE(Perl Compatible Regular Expression).
What you are looking for is the POSIX blank
character class. In Perl it is referenced as:
[[:blank:]]
in Java (don't forget to enable UNICODE_CHARACTER_CLASS
):
\p{Blank}
Compared to the similar \h
, POSIX blank
is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h
chooses to additionally include the MONGOLIAN VOWEL SEPARATOR
.) However, an argument in favor of \h
is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).
But the problem is that even sticking to Unicode doesn't solve the issue 100%. Consider the following characters which are not considered whitespace in Unicode:
U+180E MONGOLIAN VOWEL SEPARATOR
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 WORD JOINER
U+FEFF ZERO WIDTH NON-BREAKING SPACE
Taken from https://en.wikipedia.org/wiki/White-space_character
The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE
, WORD JOINER
, and ZERO WIDTH NON-BREAKING SPACE
(if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.
In Java:
static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
perl
tag in the original question. –
Longsighted [\p{Blank}\u200b\u180e]
are required. Admittedly, it makes sense that a vowel separator is not considered a whitespace character, but why zero-width space is not in classes like \s
and \p{Blank}
, beats me. –
Procedure Put the regex below in the find section and select Regular Expression from "Search Mode":
[^\S\r\n]+
You probably want \h
, as others have pointed out. However, Perl v5.18 and later supports regex set operations as part of its Unicode support. If you want most of something, it may easier to subtract out the few things you don't want.
Suppose that you'll accept any whitespace except for exactly the newline. You don't care about carriage returns, form feeds, or vertical tabs. This regex set operation creates a character class by starting with all whitespace and removing the newline:
use v5.18;
/(?[ [\s] - [\n] ])/;
Here's another one. Suppose you want all the latin letters except for vowels. You could write that out with the omissions and hope you don't make a mistake:
/[b-df-hj-np-tv-z]/;
It's easier when the code cleanly shows what you are doing:
use v5.18;
/(?[ [a-z] - [aeiou] ])/;
m/ /g
just give space in / /
, and it will work. Or use \S
— it will replace all the special characters like tab, newlines, spaces, and so on.
© 2022 - 2024 — McMap. All rights reserved.
[\r\f]
. – Mezzotint