This answer explains the problem, provides a simple solution, and then goes deep.
The problem with your grammar
First, your SO demonstrates what seems to be either an extraordinary bug or a common misunderstanding. See JJ's answer for the issue he's filed to follow up on it, and/or my footnote.[4]
Putting the bug/"bug" aside, your grammar directs Raku to not match your input:
The [<subdomain> '.']*
atom eagerly consumes the string 'baz.example.'
from your input;
The remaining input ('com'
) fails to match the remaining atoms (<domain> '.' <tld>
);
The :ratchet
that's in effect for token
s means the grammar engine does not backtrack into the [<subdomain> '.']*
atom.
Thus the overall match fails.
The simplest solution
The simplest solution to make your grammar work is to append !
to the [<subdomain> '.']*
pattern in your token
.
This has the following effect:
If any of the remainder of the token
fails (after the subdomain atom), the grammar engine will backtrack to the subdomain atom, drop the last of its match repetitions, and then try move forward again;
If matching fails again the engine will again backtrack to the subdomain atom, drop another repetition, and try again;
The grammar engine will repeat the above actions until either the rest of the token
matches or there are no matches of the [<subdomain> '.']
atom left to backtrack on.
Note that adding the !
to the subdomain atom means that the backtracking behaviour is limited to just the subdomain atom; if the domain atom matches, but then the tld atom did not, the token would fail instead of trying to backtrack. This is because the whole point of token
s is that, by default, they don't backtrack to earlier atoms after they've succeeded.
Playing with Raku, developing grammars, and debugging
Nil
is fine as a response from a grammar that is known (or thought) to work fine, and you don't want any more useful response in the event of a parse fail.
For any other scenario there are much better options, as summarized in my answer to How can error reporting in grammars be improved?.
In particular, for playing around, or developing a grammar, or debugging one, the best option by far is to install the free Comma and use its Grammar Live View feature.
Fixing your grammar; general strategies
Your grammar suggests two three options1:
Parse forwards with some backtracking. (The simplest solution.)
Parse backwards. Write the pattern in reverse, and reverse the input and output.
Post parse the parse.
Parse forwards with some backtracking
Backtracking is a reasonable approach for parsing some patterns. But it is best minimized, to maximize performance, and even then still carries DoS risks if written carelessly.2
To switch on backtracking for an entire token, just switch the declarator to regex
instead. A regex
is just like a token but specifically enables backtracking like a traditional regex.
Another option is to stick with token
and limit the part of the pattern that might backtrack. One way to do that is to append a !
after an atom to let it backtrack, explicitly overriding the token
's overall "ratchet" that would otherwise kick when that atom succeeds and matching moves on to the next atom:
token TOP { <name> '@' [<subdomain> '.']*! <domain> '.' <tld> }
🡅
An alternative to !
is to insert :!ratchet
to switch "ratcheting" off for a part of a rule, and then :ratchet
to switch ratcheting back on again, eg:
token TOP { <name> '@' :!ratchet [<subdomain> '.']* :ratchet <domain> '.' <tld> }
(You can also use r
as an abbreviation for ratchet
, i.e. :!r
and :r
.)
Parse backwards
A classic parsing trick that works for some scenarios, is to parse backwards as a way to avoid backtracking.
grammar Email {
token TOP { <tld> '.' <domain> ['.' <subdomain> ]* '@' <name> }
token name { \w+ ['.' \w+]* }
token domain { \w+ }
token subdomain { \w+ }
token tld { \w+ }
}
say Email.parse(flip '[email protected]').hash>>.flip;
#{domain => example, name => foo.bar, subdomain => [baz], tld => com}
Probably too complicated for most folks' needs but I thought I'd include it in my answer.
Post parse the parse
In the above I presented a solution that introduces some backtracking, and another that avoids it but with significant costs in terms of ugliness, cognitive load etc. (parsing backwards?!?).
There's another very important technique that I overlooked until reminded by JJ's answer.1 Just parse the results of the parse.
Here's one way. I've completely restructured the grammar, partly to make more sense of this way of doing things, and partly to demonstrate some Raku grammar features:
grammar Email {
token TOP {
<dotted-parts(1)> '@'
$<host> = <dotted-parts(2)>
}
token dotted-parts(\min) { <parts> ** {min..*} % '.' }
token parts { \w+ }
}
say Email.parse('[email protected]')<host><parts>
displays:
[「baz」 「buz」 「example」 「com」]
While this grammar matches the same strings as yours, and post-parses like JJ's, it's obviously very different:
The grammar is reduced to three tokens.
The TOP
token makes two calls to a generic dotted-parts
token, with an argument specifying the minimum number of parts.
$<host> = ...
captures the following atom under the name <host>
.
(This is typically redundant if the atom is itself a named pattern, as it is in this case -- <dotted-parts>
. But "dotted-parts" is rather generic; and to refer to the second match of it (the first comes before the @
), we'd need to write <dotted-parts>[1]
. So I've tidied up by naming it <host>
.)
The dotted-parts
pattern may look a bit challenging but it's actually pretty simple:
It uses a quantifier clause (** {min..max}
) to express any number of parts provided it's at least the minimum.
It uses a modifier clause (% <separator>
) which says there must be a dot between each part.
<host><parts>
extracts from the parse tree the captured data associated with the parts
token of the second use in the TOP
rule of dotted-parts
. Which is an array: [「baz」 「buz」 「example」 「com」]
.
Sometimes one wants some or all of the the reparsing to happen during the parsing, so that reparsed results are ready when a call to .parse
completes.
JJ has shown one way to code what are called actions. This involved:
Creating an "actions" class containing methods whose names corresponded to named rules in the grammar;
Telling the parse method to use that actions class;
If a rule succeeds, then the action method with the corresponding name is called (while the rule remains on the call stack);
The match object corresponding to the rule is passed to the action meethod;
The action method can do whatever it likes, including reparsing what just got matched.
It's simpler and sometimes better to write actions directly inline:
grammar Email {
token TOP {
<dotted-parts(1)> '@'
$<host> = <dotted-parts(2)>
# The new bit:
{
make (subs => .[ 0 .. *-3 ],
dom => .[ *-2 ],
tld => .[ *-1 ])
given $<host><parts>
}
}
token dotted-parts(\min) { <parts> ** {min..*} % '.' }
token parts { \w+ }
}
.say for Email.parse('[email protected]') .made;
displays:
subs => (「baz」 「buz」)
dom => 「example」
tld => 「com」
Notes:
I've directly inlined the code doing the reparsing.
(One can insert arbitary code blocks ({...}
) anywhere one could otherwise insert an atom. In the days before we had grammar debuggers a classic use case was { say $/ }
which prints $/
, the match object, as it is at the point the code block appears.)
If a code block is put at the end of a rule, as I have done, it is almost equivalent to an action method.
(It will be called when the rule has otherwise completed, and $/
is already fully populated. In some scenarios inlining an anonymous action block is the way to go. In others, breaking it out into a named method in an action class like JJ did is better.)
make
is a major use case for action code.
(All make
does is store its argument in the .made
attribute of $/
, which in this context is the current parse tree node. Results stored by make
are automatically thrown away if backtracking subsequently throws away the enclosing parse node. Often that's precisely what one wants.)
foo => bar
forms a Pair
.
The postcircumfix [...]
operator indexes its invocant:
- In this case there's just a prefix
.
without an explicit LHS so the invocant is "it". The "it" was setup by the given
, i.e. it's (excuse the pun) $<host><parts>
.
The *
in the index *-n
is the invocant's length; so [ 0 .. *-3 ]
is all but the last two elements of $<host><parts>
.
The .say for ...
line ends in .made
3, to pick up the make
d value.
The make
'd value is a list of three pairs breaking out $<host><parts>
.
Footnotes
1 I had truly thought my first two options were the two main ones available. It's been around 30 years since I encountered Tim Toady online. You'd think by now I'd have learned by heart his eponymous aphorism -- There Is More Than One Way To Do It!
2 Beware "pathological backtracking". In a production context, if you have suitable control of your input, or the system your program runs on, you may not have to worry about deliberate or accidental DoS attacks because they either can't happen, or will uselessly take down a system that's rebootable in the event of being rendered unavailable. But if you do need to worry, i.e. the parsing is running on a box that needs to be protected from a DoS attack, then an assessment of the threat is prudent. (Read Details of the Cloudflare outage on July 2, 2019 to get a real sense of what can go wrong.) If you are running Raku parsing code in such a demanding production environment then you would want to start an audit of code by searching for patterns that use regex
, /.../
(the ...
are metasyntax), :!r
(to include :!ratchet
), or *!
.
3 There's an alias for .made
; it's .ast
. I think it stands for A Sparse Tree or Annotated Subset Tree and there's a cs.stackexchange.com question that agrees with me.
4 Golfing your problem, this seems wrong:
say 'a' ~~ rule { .* a } # 「a」
More generally, I thought the only difference between a token
and a rule
was that the latter injects a <.ws>
at each significant space. But that would mean this should work:
token TOP { <name> <.ws> '@' <.ws> [<subdomain> <.ws> '.']* <.ws>
<domain> <.ws> '.' <.ws> <tld> <.ws>
}
But it doesn't!
At first this freaked me out. Writing this footnote two months later, I'm feeling somewhat less freaked out.
Part of this is my speculation about the reason I've not been able to find anyone reporting this in the 15 years since the first Raku grammar prototype became available via Pugs. That speculation includes the possibility that @Larry deliberately designed them to work as they do, and it being a "bug" is primarily misunderstanding among the current crop of mere mortals like us trying to provide an explanation for why Raku does what it does based on our analysis of our sources -- roast, the original design documents, the compiler source code, etc.
In addition, given that the current "buggy" behaviour seems ideal and intuitive (except for contradicting the doc), I'm focusing on interpreting my feeling of great discomfort -- during this interim period of unknown length in which I don't understand why it gets it right -- as a positive experience. I hope others can too -- or, much better, figure out what is actually going on and let us know!
[context-free-grammar]
because I originally didn't know any better. A few hours later I had read more up on it all, and you're right, it is closer to unrestricted grammar – Pierpont