Using 'after' as lookbehind in a grammar in raku
Asked Answered
F

3

8

I'm trying to do a match in a raku grammar and failing with 'after'. I've boiled down my problem to the following snippet:

grammar MyGrammar {

    token TOP {
        <character>
    }

    token character {
        <?after \n\n>LUKE
    }
}

say MyGrammar.subparse("\n\nLUKE");

This returns #<failed match> as MyGrammar.subparse and Nil as MyGrammar.parse.

But if I run a match in the REPL:

"\n\nLUKE" ~~ /<?after \n\n>LUKE/

I get the match 「LUKE」

So there's something I'm not understanding, and I'm not sure what. Any pointers?

Fructiferous answered 1/7, 2020 at 22:10 Comment(0)
T
10

When we parse a string using a grammar, the matching is anchored to the start of the string. Parsing the input with parse requires us to consume all of the string. There is also a subparse, which allows us to not consume all of the input, but this is still anchored to the start of the string.

By contrast, a regex like /<?after \n\n>LUKE/ will scan through the string, trying to match the pattern at each position in the string, until it finds a position at which it matches (or gets to the end of the string and gives up). This is why it works. Note, however, that if your goal is to not capture the \n\n, then you could instead have written the regex as /\n\n <( LUKE/, where <( indicates where to start capturing. At least on the current Rakudo compiler implementation, this way is more efficient.

It's not easy to suggest how to write the grammar without a little more context (I'm guessing this is extracted from a larger problem). You could, for example, consume whitespace at the start of the grammar:

grammar MyGrammar {

    token TOP {
        \s+ <character>
    }

    token character {
        <?after \n\n>LUKE
    }
}

say MyGrammar.subparse("\n\nLUKE");

Or consume the \n\n in character but exclude it from the match with <(, as mentioned earlier.

Tab answered 1/7, 2020 at 23:29 Comment(4)
Accepted as I have now used <( to match the pattern but not capture it, which does what my intention was with afterFructiferous
@Fructiferous "I have used <( to match the pattern but not capture it". Oh! You mean the pattern to the left. After all, <( does look like it points left! (The current <(...)> doc reflects the symbolism originally intended. So it says <( "indicates the start of the match's overall capture". That is to say, the ( marks the left of what is captured (cf ( in (...)) and the < marks the left of the assertion (cf < in <...>). And ... <(LUKE)> ... would capture just LUKE.)Existent
@ralph Yes, to match \n\n, but not capture it, and allow the match to continue. I had successfully used a <before ...> at the end of my token (not in this example), which was why I was being frustrated by <after ...> not doing what I expected. I think in my full token, I will now probably try <(...)>Fructiferous
Oh- so the regex scans in 2 chars (\n\n) and then matches - but the grammar never scans so never matches...Molybdous
E
6

<?after ...> does not advance the match cursor

Of crucial import here is that <?after \n\n> is a "zero width" assertion.

It matches if the match cursor is sitting to the immediate right of "\n\n" in the string being matched, but it doesn't advance the match cursor.

Why the ~~ / ... / version matches

The regex/grammar engine is automatically advancing the match cursor for you.

A plain regex-style match works like traditional regexes. In particular, it is supposed to match anywhere in the string being matched, unless you explicitly add anchors such as ^ (start of string) and/or $ (end of string).

More explicitly, the match engine will start by trying to match at the first character position of a string being matched. Then, if that fails, it'll automatically move forward one character in the string, and then try again to match from the start of the regex pattern.

So all of these will also match and give the same result:

"\n\nLUKE" ~~ /LUKE/;                     # 「LUKE」
"\n\nLUKE" ~~ /LUKE $/;                   # 「LUKE」
"LUKE"     ~~ /^ LUKE $/;                 # 「LUKE」
"\n\nLUKE" ~~ / <?after \n\n>LUKE $/;     # 「LUKE」

Why the grammar version doesn't match

A grammar is expected to match starting at the start of the input string. Otherwise it fails.

More explicitly, .parse has implicit ^ and $ anchors at the start and end of a parse, and .subparse has an implicit ^ at the start.

If the match cursor fails to progress past the first character then the parse fails. Your grammar doesn't progress the match cursor past the first character, so it fails.

(The <?after \n\n> not only would fail to advance the cursor if it matched, it never even matches in the first place -- because at the start of the string the match cursor is only after nothing. If you had written <?after ''> instead, then that would always succeed, but would still not advance the cursor, so the grammar would still fail if that's the only change you made.)

Existent answered 1/7, 2020 at 23:27 Comment(2)
Thank you for the explanation. I think the crux may be "...and does not advance the match cursor" which I'm assuming I sort of thought it did even if it wasn't capturing.Fructiferous
I've updated the answer to highlight the crux, and increase the distinction between this answer and the others, to make it more valuable to later readers. Thanks for commenting!Existent
Z
4

The current answers are excellent, but let me be a bit more verbose in explaining the origin of the misunderstanding. The main point is that here you're comparing a token that is part of a grammar with a standalone regex. They use the same language, regular expressions, but they are not the same. You can use a regex to match, substitute and extract information; the objective of a token is purely extracting information; from a string with a regular estructure, I want a part and just that part. I assume you're insterested in the LUKE part, and that you are using <after to kinda express "No, not what I'm interested this", or "Skip this, get me only the goods". Jonathan has already said one way, probably the best, to do so:

grammar MyGrammar {

    token TOP {
        <character>
    }

    token character {
         \n \n <( LUKE
    }
}

say MyGrammar.subparse("\n\nLUKE");

Will not only math, but also only capture LUKE:

「

LUKE」
 character => 「LUKE

skipping over that. However, grammars don't match, they extract. So you probably want the separators to also be in the grammar, not worth the while to repeat them over and over. Besides, in general grammars are intended to be used top-down. So this will do:

grammar MyGrammar {

    token TOP {
        <separator><character>
    }

    token separator { \n \n }
    token character { <[A..Z]>+  }
}

say MyGrammar.parse("\n\nLUKE");

The character token is now more general (although maybe it coud use some whitespaces, I don't know. Again, maybe you're not interested in the separator. Just use a dot to ignore it. Just because you're not interested does not mean you don't have to parse it, and grammars give you a way of doing it:

grammar MyGrammar {

    token TOP {
        <.separator><character>
    }

    token separator { \n \n }
    token character { <[A..Z]>+  }
}

say MyGrammar.parse("\n\nLUKE");

This one gives the same result:

「

LUKE」
 character => 「LUKE」

At the end of the day, grammars and regexes have different use cases, and thus different solutions for the same objective. Thinking about them in the proper way gives you a hint on how to structure them.

Zaibatsu answered 2/7, 2020 at 7:29 Comment(5)
"comparing a token with a regex." The only difference between the effect of a Raku token and regex construct is that by default the latter backtracks as much as needed to match, like a traditional regex, whereas a token doesn't (it ratchets instead). This SO isn't about ratcheting. There's also a difference between token and regex, and corresponding unavoidable ambiguity in Raku design doc about what "token" and "regex" mean.Existent
Thank you for the explanation. I think it is good advice to break down into more parts, which I'll look at.Fructiferous
@Existent that's correct in theory, but in practice tokens are part of a grammar and as such their use case is totally different. I'll clarify anywayZaibatsu
@Zaibatsu Sure. I didn't mean for you to edit your answer (if I did I would have addressed it to you), but imo your edit is an improvement. "in practice tokens are part of a grammar and as such their use case is totally different." Yes, "tokens" are. But not tokens. I commented because I thought there'd be a good chance readers who don't know Raku and/or parsing well might think a token is what you meant by "token".Existent
... which is why I did it and I appreciate any comment you do.Zaibatsu

© 2022 - 2024 — McMap. All rights reserved.