Bash Regular Expression -- Can't seem to match any of \s \S \d \D \w \W etc
Asked Answered
W

6

24

I have a script that is trying to get blocks of information from gparted.

My Data looks like:

Disk /dev/sda: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      1049kB  316MB   315MB   primary  ext4            boot
 2      316MB   38.7GB  38.4GB  primary  ext4
 3      38.7GB  42.9GB  4228MB  primary  linux-swap(v1)

log4net.xml
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      1049kB  316MB   315MB   primary  ext4            boot
 5      316MB   38.7GB  38.4GB  primary  ext4
 6      38.7GB  42.9GB  4228MB  primary  linux-swap(v1)

I use a regex to break this into two Disk blocks

^Disk (/dev[\S]+):((?!Disk)[\s\S])*

This works with multiline on.

When I test this in a bash script, I can't seem to match \s, or \S -- What am I doing wrong?

I am testing this through a script like:

data=`cat disks.txt`
morematches=1
x=0
regex="^Disk (/dev[\S]+):((?!Disk)[\s\S])*"

if [[ $data =~ $regex ]]; then
    echo "Matched"
    while [ $morematches == 1 ]; do
        x=$[x+1]
        if [[ ${BASH_REMATCH[x]} != "" ]]; then
            echo $x "matched" ${BASH_REMATCH[x]}
        else
            echo $x "Did not match"
            morematches=0
        fi
    done
fi

However, when I walk through testing parts of the regex, whenever I match a \s or \S, it doesn't work -- what am I doing wrong?

Waynant answered 29/8, 2013 at 14:44 Comment(7)
Apparently so.. I guess every other regex engine I've used has been using the perl conventionsWaynant
\s and \S are PCRE extensions; they are not present in the ERE (Posix Extended Regular Expression) standard. Just be glad you aren't trying to use BRE.Upanchor
...by the way, a lot of the PCRE extensions are poorly-thought-out things with absolutely horrid worst-case performance (particularly, lookahead/lookbehind). Choosing to use ERE instead is, as a rule, very much defensible.Upanchor
...see in particular swtch.com/~rsc/regexp/regexp1.htmlUpanchor
...kibitzing on some other points: x=$[x+1] is an antique syntax; ((x++)) is the modern bash version, or x=$((x + 1)) the modern POSIX version. Using == inside of [ ] is not POSIX-compliant; either use [[ ]] (which doesn't try to be POSIX compliant, and allows you to not quote by virtue of having parse-time rules that turn off string-splitting) or use = instead of == (and make it [ "$morematches" = 1 ], WITH THE QUOTES!). Always quote your expansions: echo "$x did not match"; otherwise, globs inside of $x are expanded and runs of whitespace compressed.Upanchor
@Waynant Your script is actually confusing to what it really wants to do. Do you want to have a message like /dev/xyz matched 4.9GB?Intrude
Konsole: This was just to text the regex, I have a larger irrelevant script that does something with /dev/sda1/dev/sda2,etc based on its file system typeWaynant
I
32

Perhaps \S and \s are not supported, or that you cannot place them around [ ]. Try to use the following regex instead:

^Disk[[:space:]]+/dev[^[:space:]]+:[[:space:]]+[^[:space:]]+

EDIT

It seems like you actually want to get the matching fields. I simplified the script to this for that.

#!/bin/bash 

regex='^Disk[[:space:]]+(/dev[^[:space:]]+):[[:space:]]+(.*)'

while read line; do
    [[ $line =~ $regex ]] && echo "${BASH_REMATCH[1]} matches ${BASH_REMATCH[2]}."
done < disks.txt

Produces:

/dev/sda matches 42.9GB.
/dev/sdb matches 42.9GB.
Intrude answered 29/8, 2013 at 14:49 Comment(2)
[[:alnum:]] and [[:digit:]] would probably be better than the "^space" constructs (even though those match what the OP asked for).Dube
@Dube Yes it could be an option too :)Intrude
S
21

Because this is a common FAQ, let me list a few constructs which are not supported in Bash (and related tools like sed, grep, etc), and how to work around them, where there is a simple workaround.

There are multiple dialects of regular expressions in common use. The one supported by Bash is a variant of Extended Regular Expressions. This is different from e.g. what many online regex testers support, which is often the more modern Perl 5 / PCRE variant.

Bash doesn't support:

  • \d \D \s \S \w \W -- these can be replaced with POSIX character class equivalents [[:digit:]], [^[:digit:]], [[:space:]], [^[:space:]], [_[:alnum:]], and [^_[:alnum:]], respectively. (Notice the last case, where the [:alnum:] POSIX character class is augmented with underscore to be exactly equivalent to the Perl \w shorthand.)
  • Non-greedy matching. You can sometimes replace a.*?b with something like a[^ab]*b to get a similar effect in practice, though the two are not exactly equivalent.
  • Non-capturing parentheses (?:...). In the trivial case, just use capturing parentheses (...) instead; though of course, if you use capture groups and/or backreferences, this will renumber your capture groups.
  • Lookarounds like (?<=before) or (?!after). (In fact anything with (? is a Perl extension.) There is no simple general workaround for these, though you can sometimes rephrase your problem into one where lookarounds can be avoided.
Shechem answered 21/2, 2018 at 5:44 Comment(3)
#19454491 has some ideas for how to reimplement lookarounds.Shechem
Perhaps tangentially see also Why are there so many different regular expression dialects?Shechem
Bash does support \s and others in certain cases, see my answer below.Pastore
D
4

from man bash

An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is con‐ sidered an extended regular expression and matched accordingly (as in regex(3)).

ERE doesn't support look-ahead/behind. However you have them in your code ((?!Disk)).

That's why your regex won't do match as you expected.

Draggletailed answered 29/8, 2013 at 14:53 Comment(5)
That, plus the lack of \s and \S.Issi
@AdrianFrühwirth \s and \S should be ok, see my answer, I added that section.Draggletailed
\s and \S may work in practice, but the bash documentation does not promise that they'll work -- only the ERE syntax parsed by regex(3) is guaranteed to be supported, and the POSIX ERE standard does not include these shortcuts. Relying on them is thus... unfortunate and fragile.Upanchor
Charles is right... on my system I get: [[ "aaa" =~ "\S+" ]] && echo "yes" || echo "no" --> noCarney
@CharlesDuffy sorry, I was testing it in zsh...... you are right, I am removing the \s \S part.Draggletailed
P
3

Bash supports what regcomp(3) supports on your system. Glibc's implementation does support \s and others, but due to the way Bash quotes stuff on binary operators, you cannot encode a proper \s directly, no matter what you do:

[[ 'a   b' =~ a[[:space:]]+b ]] && echo ok # OK
[[ 'a   b' =~ a\s+b ]] || echo fail        # Fail
[[ 'a   b' =~ a\\s+b ]] || echo fail       # Fail
[[ 'a   b' =~ a\\\s+b ]] || echo fail      # Fail

It is much simpler to work with a pattern variable for this:

pattern='a\s+b'
[[ 'a   b' =~ $pattern ]] && echo ok # OK
Pastore answered 31/5, 2022 at 18:51 Comment(1)
This is then obviously only true on systems where Bash was compiled with Glibc. For me, it works out of the box on Ubuntu, but not on MacOS.Shechem
K
0

Also, [\s\S] is equivalent to ., i.e., any character. On my shell, [^\s] works but not [\S].

Kotz answered 29/8, 2013 at 15:2 Comment(2)
[^\s] doesn't do what you think, it just matches a string which isn't sShechem
For most real-life scenarios, . is equivalent to but more portable and readable than [\s\S]. The latter apparently gets passed along among the superstitious as something which might magically match newlines where . doesn't; but in many contexts where you see it, the tool doesn't allow a regex match to straddle newlines in the first place anyway.Shechem
B
-1

I know you already "solved" this, but your original issue was probably as simple as not quoting $regex in your test. ie:

if [[ $data =~ "$regex" ]]; then

Bash variable expansion will simply plop in the string, and the space in your original regex will break test because:

regex="^Disk (/dev[\S]+):((?!Disk)[\s\S])*"
if [[ $data =~ $regex ]]; then

is the equivalent of:

if [[ $data =~ ^Disk (/dev[\S]+):((?!Disk)[\s\S])* ]]; then

and bash/test will have a fun time interpreting a bonus argument and all those unquoted meta-characters.

Remember, bash does not pass variables, it expands them.

Borries answered 29/8, 2013 at 19:38 Comment(2)
This was pretty confusing after my 20 minute crash course ;) I ended up just writing a small perl script that I invoke and that was alot simpler. I hadn't realized that the bash regex conventions were so different as pretty much everything else I have used supports perl-style.Waynant
This answer isn't actually correct -- [[ has its own parser-level handling; it treats the content on the right-hand side as a literal string if quoted, and a regex if unquoted; it does not perform word-splitting or globbing. This means regex='.+'; [[ $data =~ $regex ]] matches any non-empty string, whereas regex='.+'; [[ $data =~ "$regex" ]] matches only strings that contain the exact text .+ within them.Upanchor

© 2022 - 2024 — McMap. All rights reserved.