Raku regex: Inconsistent longest token matching
Asked Answered
R

1

14

Raku's regexes are expected to match longest token.

And in fact, this behaviour is seen in this code:

raku -e "'AA' ~~ m/A {say 1}|AA {say 2}/"
# 2

However, when the text is in a variable, it does not seem to work in the same way:

raku -e "my $a = 'A'; my $b = 'AA'; 'AA' ~~ m/$a {say 1}|$b {say 2}/"
# 1

Why they work in a different way? Is there a way to use variables and still match the longest token?

Radioman answered 17/10, 2020 at 21:43 Comment(0)
B
14

There are two things at work here.

The first is the meaning of "longest token". When there is an alternation (using | or implied by use of proto regexes), the declarative prefix of each branch is extracted. Declarative means the subset of the Raku regex language that can be matched by a finite state machine. The declarative prefix is determined by taking regex elements until a non-declarative element is encountered. You can read more and find some further references in the docs.

To understand why things are this way, a small detour may be helpful. A common approach to building parsers is to write a tokenizer, which breaks the input text up into a sequence of "tokens", and then a parser that identifies larger (and perhaps recursive) structure from those tokens. Tokenizing is typically performed using a finite state machine, since it is able to rapidly cut down the search space. With Raku grammars, we don't write the tokenizer ourselves; instead, it's automatically extracted from the grammar for us (more precisely, a tokenizer is calculated per alternation point).

Secondly, Raku regexes are a nested language within the main Raku language, parsed in a single pass with it and compiled at the same time. (This is a departure from most languages, where regexes are provided as a library that we pass strings to.) The longest token calculation takes place at compile time. However, variables are interpolated at runtime. Therefore, a variable interpolation in a regex is non-declarative, and therefore is not considered as part of the longest token matching.

Baghdad answered 17/10, 2020 at 22:23 Comment(4)
Also, the nature of Raku's expressiveness means that it's fairly easy to define a regex token longest (or shortest) that takes full regexen as arguments. (That said, it won't necessarily be as optimized since at least the naïve method would require following each match in full), but the fact that you can is pretty cool). Also, brb while I go make that module.Antalkali
So Raku does 'textual order' matching instead of 'longest token' matching where variables are involved on the RHS? A quick test with the four regexes /$a {say 1}|$b {say 2}/ , /$a {say 1}||$b {say 2}/ , /$b {say 2}||$a {say 1}/ , and /$b {say 2}|$a {say 1}/ suggests this is true.Bedclothes
@jubilatious1. I think the key here is what Jonathan said: The declarative prefix is determined by taking regex elements **until** a non-declarative element is encountered So for example this two matches are different because of the placement of the variables: 'AAAA' ~~ m/$a A {say 1}|$b AA {say 2}/ says 1 (no tokens are processed, so it behaves as textual order). And 'AAAA' ~~ m/A $a {say 1}|AA $b {say 2}/ says 2 because A and AA tokens were processed so it chooses second alteration, which is longerRadioman
Source order is used as a tie-breaker if the longest declarative prefixes are of equal (potentially zero) length and if the longest literal tie-breaker can't resolve the matter either (which it cannot on a zero-length prefix).Baghdad

© 2022 - 2024 — McMap. All rights reserved.