Raku regex: How to know which group was captured at an alternation
Asked Answered
R

2

10

With perl (and almost any regex flavour), every group is numbered sequentially.

So for example, this code:

'bar' =~ m/(foo)|(bar)/;

print $1 // 'x'; # (1-based index)
print $2 // 'x'; # (1-based index)

prints xbar

However, with Raku it behaves like there was a branch reset group wrapping the whole regex:

'bar' ~~ m/(foo)|(bar)/;

print $0 // 'x'; # (0-based index)
print $1 // 'x'; # (0-based index)

prints barx

I'm ok with this behaviour :). However, it is sometimes useful to know which group was captured under an alternation.

How can I know the group with raku?

Resistive answered 16/10, 2020 at 19:1 Comment(1)
The OP may already know this, but another difference between Perl5 and Raku is that Raku's | alternation operator does Longest Token Matching (LTM), not sequential (i.e. "first named") token matching. See: docs.raku.org/language/regexes#Longest_alternation:_| and docs.raku.org/language/… .Begum
R
10

There are a few ways to do, with varying degrees of utility.

One way would be to explicitly tell Raku what you want the numbers to be:

'bar' ~~ m/$1=(foo)|$2=(bar)/;

If you extend the regex, counting will continue at $3.

A less-recommendable way to do this would be to sneak in an extra set of parentheses:

'bar' ~~ m/(foo)|()(bar)/;

foo will match the first one in $0 and $1 will be undefined, and bar will match the $1 with $0 being empty (but not undefined). TIMTOWTDI but this is not a good one ;-)

Another way could be to use a flag:

 my $flag;
'bar' ~~ m/(foo {$flag = 'first'} ) | (bar {$flag = 'second'} )/;

The flag will be set based on the match. This can actually be a not-terrible way to do things, especially if your flag is binary and you will have some logic that you'll run over it.

Another similar way would be to take advantage of the .make/.made that's normally used in action classes, but can still be used inline too:

'bar' ~~ m/(foo {make 'first'} ) | (bar {make 'second'} )/;
say $0.made; # 'second'

This one is nice if you have a lot of metadata you want to associate with it (but probably overkill for just knowing which one was chosen).

Rentsch answered 16/10, 2020 at 20:28 Comment(1)
Wow, more methods than expected. Certainly, TIMTOWTDI. Thank you!Resistive
A
2

There are a few things that cause the capture index to reset. | and || happen to be one.

Putting it inside of another capture group is another. (Because the match result is a tree.)


When Raku was being designed everything was redesigned to be more consistent, more useful, and more powerful. Regexes included.

If you have an alternation something like this:

/  (foo)  |  (bar)  /

You might want to use it like this:

$line ~~ /  (foo)  |  (bar)  /;
say %h{ ~$0 };

If the (bar) was $1 instead, you would have to write it something like this:

$line ~~ /  (foo)  |  (bar)  /;
say %h{ ~$0 || ~$1 };

It is generally more useful for the capture group numbering to start again from zero.

This also makes it so that a regex is more like a general purpose programming language. (Each “block” is an independant subexpression.)


Now sometimes it might be nice to renumber the capture groups.

/ ^
[   (..) '-'  (..) '-' (....)  # mm-dd-yyyy
|   (..) '-' (....)            # mm-yyyy
]
$ /

Notice that the yyyy part is either $2 or $1 depending on whether the dd part is included.

my $day   = +$2 ?? $1 !! 1;
my $month = +$0;
my $year  = +$2 || +$1;

We can renumber the yyyy to always be $2.

/ ^
[   (..) '-'  (..) '-' (....)  # mm-dd-yyyy
|   (..) '-' $2 = (....)       # mm-yyyy
]
$ /

my $day   = +$1 || 1;
my $month = +$0;
my $year  = +$2;

Or what if we need to also accept yyyy-mm-dd

/ ^
[   (..) '-' (..) '-' (....)                # mm-dd-yyyy
|   (..) '-' $2 = (....)                    # mm-yyyy
|   $2 = (....) '-' $0 = (..) '-' $1 = (..) # yyyy-mm-dd
]
$ /

my $day   = +$1 || 1
my $month = +$0;
my $year  = +$2;

Actually now that we have a lot of capture groups let's look again how we would handle it if | didn't cause the numbered capture groups to start again from $0

/ ^
[   (..) '-' (..) '-' (....) # mm-dd-yyyy
|   (..) '-' (....)          # mm-yyyy
|   (....) '-' (..) '-' (..) # yyyy-mm-dd
]
$ /

my $day   = +$1 || +$7 ||   1;
my $month = +$0 || +$3 || +$6;
my $year  = +$2 || +$4 || +$5;

That is not great.
For one thing you have to make sure both the regex and the my $day match up correctly.

Quick without counting capture groups, make sure that those numbers match the correct capture groups.


Of course that still has the issue that concepts which have a name are instead captured by a number.

So we should use names instead.

/ ^
[   $<month> = (..) '-' $<day> = (..) '-' $<year> = (....) # mm-dd-yyyy
|   $<month> = (..) '-' $<year> = (....)                   # mm-yyyy
|   $<year> = (....) '-' $<month> = (..) '-' $<day> = (..) # yyyy-mm-dd
]
$ /

my $day   = +$<day> || 1;
my $month = +$<month>;
my $year  = +$<year>;

So long story short, I would do this:

/ $<foo> = (foo)  |  $<bar> = (bar) /;


if $<foo> {
    …
} elsif $<bar> {
    …
}
Airlike answered 17/10, 2020 at 19:57 Comment(1)
Great answer. I had thought about including the why, but I'm glad I didn't as you did a much better job explaining it. Definitely agree about using named captures too. I think I've used numerical captures only once or twice in production coffee, otherwise always namedRentsch

© 2022 - 2024 — McMap. All rights reserved.