Bash regex ungreedy match
Asked Answered
B

4

16

I have a regex pattern that is supposed to match at multiple places in a string. I want to get all the match groups into one array and then print every element.

So, I've been trying this:

#!/bin/bash

f=$'\n\tShare1   Disk\n\tShare2  Disk\n\tPrnt1  Printer'
regex=$'\n\t(.+?)\\s+Disk'
if [[ $f =~ $regex ]]
then
    for match in "${BASH_REMATCH[@]}"
    do
        echo "New match: $match"
    done
else
    echo "No matches"
fi

Result:

New match: 
    Share1   Disk
    Share2  Disk
New match: Share1   Disk
    Share2 

The expected result would have been

New match: Share1
New match: Share2

I think it doesn't work because my .+? is matching greedy. So I looked up how this could be accomplished with bash regex. But everyone seems to suggest to use grep with perl regex.

But surely there has to be another way. I was thinking maybe something like [^\\s]+.. But the output for that was:

New match: 
    Share1   Disk
New match: Share1

... Any ideas?

Bourse answered 14/12, 2016 at 12:44 Comment(7)
One idea would be to use [^\\s]+? instead of .+? . That will match characters until a whitespace if found.Soonsooner
Both produce the same result as [^\\s]+ which I have already mentioned in my question. I don't think that the ? is even supported in bash, I mean in this context.. I mean a ? behind a + usually means match ungreedy.Bourse
Based on this answer POSIX regular expression (which is what is used with =~ operator) does not have non-greedy quantifiers.Shawnna
@ThomasAyoub: Thanks for pointing out. @Forivin: [^\s] is same as \S. Use `\` for escaping if needed.Soonsooner
@Forivin: You should use the first captured group . Something like $match[1] (Not good with bash).Soonsooner
@Soonsooner Well, yes I escaped it propery, but as I said didn't produce the desired result.Bourse
You should split the string with newline first, then iterate the chunks checking each with your regex and grab the value using ${BASH_REMATCH[1]}.Preceding
M
6

There are a couple of issues here. First, the first element of BASH_REMATCH is the entire string that matched the pattern, not the capture group, so you want to use ${BASH_REMATCH[@]:1} to get those things that were in the capture groups.

However, bash regex doesn't support repeating the matches multiple times in the string, so bash probably isn't the right tool for this job. Since things are on their own lines though, you could try to use that to split things and apply the pattern to each line like:

f=$'\n\tShare1   Disk\n\tShare2  Disk\n\tPrnt1  Printer'
regex=$'\t(\S+?)\\s+Disk'
while IFS=$'\n' read -r line; do
    if [[ $line =~ $regex ]]
    then
        printf 'New match: %s\n' "${BASH_REMATCH[@]:1}"
    else
        echo "No matches"
    fi
done <<<"$f"
Mak answered 14/12, 2016 at 13:18 Comment(2)
My mind is blown that bash can actually do array slicing (yes I went to figure what in the world is ${arr[@]:1})Dimension
The ? in (\S+?) has no effect, ERE do not support ungreedy at all, instead the docs say "The behavior of multiple adjacent duplication symbols ( +, *, ?, and intervals) produces undefined results."Stature
P
6

As the accepted answer already states, the solution here is not really to use a non-greedy regex, because Bash doesn't support the notation .*? (it was introduced in Perl 5, and is available in languages whose regex implementation derives from that, but Bash is not one of them). But for visitors finding this question in Google, the answer to the actual question in the title is sometimes to simply use a more limited regex than .* to implement the non-greedy matching you are looking for.

For example,

re='(Disk.*)'
if [[ $f =~ $re ]]; then
 ... # ${BASH_REMATCH[0]} contains everything after (the first occurrence of) Disk

This is just a building block; you would have to take it from there with additional regex matches or a loop. See below for a non-regex variation which does by and large this.

If the thing you don't want to match is a specific character, using a negated character class is simple, elegant, convenient, and compatible back to the dark beginnings of Ken Thompson's original regular expression library. In the OP's example, it looks like you want to skip over a newline and a tab, then match any characters which are not literal spaces.

re=$'\n\t([^ ]+)'

But probably in this case a better solution is to actually use parameter expansions in a loop.

f=$'\n\tShare1   Disk\n\tShare2  Disk\n\tPrnt1  Printer'
result=()
f=${f#$'\n\t'}      # trim any newline + tab prefix
while true; do
  case $f in
    *\ Disk*)
        d=${f%% *}           # capture up to just before first space
        result+=("$d")
        f=${f#*$'\n\t'}     # trim up to next newline + tab
        ;;
    *)
        break ;;
  esac
done
echo "${result[@]}"
Perigon answered 20/1, 2021 at 11:10 Comment(1)
See also #18514635 for a broader discussion of how to work around the lack of some PCRE regex features in Bash (and more generally POSIX-style regular expressions).Perigon
I
2

I came across a very similar problem and solved it in the manner below.

#!/bin/bash

# Captures all %{...} patterns and stops greedy matching by not matching 
# the } inside using [^}] yet capturing it once outside. 
# It also matches all remaining characters.
 
regex="^[^}]*(%{[^}]+})(.*)"

URL="http://%{host}/%{path1}/%{path2}"

value=$URL
matches=()

while true 
do
  if [[ $value =~ $regex ]]
  then 
    matches+=( ${BASH_REMATCH[1]} )
    value=${BASH_REMATCH[2]};
    echo "Yes: ${BASH_REMATCH[1]}  ${BASH_REMATCH[2]}";
  else 
    break; 
  fi
done

echo ${matches[@]}

Output of above will be the following with the last line the array of matches:

$ . loop-match.sh
Yes: %{host}  /%{path1}/%{path2}
Yes: %{path1}  /%{path2}
Yes: %{path2}

%{host} %{path1} %{path2}
Intratelluric answered 29/8, 2022 at 7:55 Comment(0)
S
0

I was looking for a generic solution to the problem of matching/replacing the first and longest instance in the middle of a string, without relying on negation. Negation can add an unnecessary layer of complexity and won't always work due to the limitations of ERE. I wanted pattern y to match in (x)(y)(z) but have x match lazily. I found it can be achieved by using substrings in addition to the regex match.

The simplest case is where the x part of the pattern need not match anything in particular, like (.*?)(baz)(.*). Drop the x part of the expression then build that implied match from the target string:

text='Foo bar, baz qux. Wiz huz baz dux.'
re='(baz)(.*)'
if [[ "$text" =~ $re ]]; then
    before_end=$(( ${#text} - ${#BASH_REMATCH[0]} ))
    # obviously no need to put $text back into the result
    # there only to demo emulation of $BASH_REMATCH for (.*?)(baz)(.*)
    ungreedy_rematch=( "$text" "${text:0:before_end}" "${BASH_REMATCH[@]:1}" )
    # inspect
    (IFS='|'; echo "$IFS${ungreedy_rematch[*]}$IFS")
    # produces: |Foo bar, baz qux. Wiz huz baz dux.|Foo bar, |baz| qux. Wiz huz baz dux.|
    # replacement
    text="${ungreedy_rematch[1]}boz${ungreedy_rematch[3]}"
    echo "|$text|"
    # produces: |Foo bar, boz qux. Wiz huz baz dux.|
fi

Where the x part does need to match something, as in the asker's case, repeat the trick:

f=$'\n\tShare1   Disk\n\tShare2  Disk\n\tPrnt1  Printer'
the_rest="$f"
regex_before=$'\n\t(.*)'
regex_after=$'\\s+Disk(\n.*|$)' # desired match is implied before this one
while [[ "$the_rest" =~ $regex_before ]]; do
    # ignore this implied match
    the_rest="${BASH_REMATCH[1]}"
    if [[ "$the_rest" =~ $regex_after ]]; then
        # get this implied match
        before_end=$(( ${#the_rest} - ${#BASH_REMATCH[0]} ))
        match="${the_rest:0:before_end}"
        the_rest="${BASH_REMATCH[1]}"
        echo "New match: $match"
    else
        break
    fi
done

In each case, the pattern must consume the remainder of the entire string in order to calculate the offset of the implied match. Wrap up the logic in a shell function if you need more reusability.

Stature answered 22/12, 2023 at 4:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.