Matching optional parameters with non-capturing groups in Bash regular expression
Asked Answered
E

2

7

I want to parse strings similar to the following into separate variables using regular expressions from within Bash:

Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";

or

Category: resource;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Resource";rel="http://schemas.ogf.org/occi/core#entity";attributes="occi.core.summary";

The first part before "title" is common to all strings, the parts title and attributes are optional.

I managed to extract the mandatory parameters common to all strings, but I have trouble with optional parameters not necessarily present for all strings. As far as I found out, Bash doesn't support Non-capturing parentheses which I would use for this purpose.

Here is what I achieved thus far:

CATEGORY_REGEX='Category:\s*([^;]*);scheme="([^"]*)";class="([^"]*)";'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo ${BASH_REMATCH[0]}
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[3]}

The regular expression I would like to use (and which is working for me in Ruby) would be:

CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(?:title="([^"]*)";)?\s*(?:rel="([^"]*)";)?\s*(?:location="([^"]*)";)?\s*(?:attributes="([^"]*)";)?\s*(?:actions="([^"]*)";)?'

Is there any other solution to parse the string with command line tools without having to fall back on perl, python or ruby?

Elias answered 3/1, 2012 at 21:26 Comment(0)
C
10

I don't think non-capturing groups exist in bash regex, so your options are to use a scripting language or to remove the ?: from all of the (?:...) groups and just be careful about which groups you reference, for example:

CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(title="([^"]*)";)?\s*(rel="([^"]*)";)?\s*(location="([^"]*)";)?\s*(attributes="([^"]*)";)?\s*(actions="([^"]*)";)?'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo "full:       ${BASH_REMATCH[0]}"
echo "category:   ${BASH_REMATCH[1]}"
echo "scheme:     ${BASH_REMATCH[2]}"
echo "class:      ${BASH_REMATCH[3]}"
echo "title:      ${BASH_REMATCH[5]}"
echo "rel:        ${BASH_REMATCH[7]}"
echo "location:   ${BASH_REMATCH[9]}"
echo "attributes: ${BASH_REMATCH[11]}"
echo "actions:    ${BASH_REMATCH[13]}"

Note that starting with the optional parameters we need to skip a group each time, because the even numbered groups from 4 on contain the parameter name as well as the value (if the parameter is present).

Cabinetmaker answered 3/1, 2012 at 21:41 Comment(3)
That indeed is working. It's not the most elegant solution, but as long as there are no non-capturing groups in bash, the workaround with skipping a group each time is probably the best solution. One thing still bothers me: If there are spaces behind any semicolon, the regex fails even so there are "\s*" patterns behind them to match whitespaces.Elias
It seems, that special characters such as "\s*" are not working. Replacing it with just a space did work though: "\s*" => " *"Elias
Try using [[:space:]]* instead of \s.Swimming
G
1

You can emulate non-matching groups in bash using a little bit of regexp magic:

              _2__    _4__   _5__
[[ "fu@k" =~ ((.+)@|)((.+)/|)(.+) ]];
echo "${BASH_REMATCH[2]:--} ${BASH_REMATCH[4]:--} ${BASH_REMATCH[5]:--}"
# Output: fu - k

Characters @ and / are parts of string we parse. Regexp pipe | is used for either left or right (empty) part matching.

For curious, ${VAR:-<default value>} is variable expansion with default value in case $VAR is empty.

Gigantes answered 3/10, 2016 at 7:2 Comment(1)
This does not work for me. I simply get three dashes.Fokker

© 2022 - 2024 — McMap. All rights reserved.