What's the real difference between a token and a rule?
Asked Answered
P

3

18

I was drawn to Raku due to its built-in grammars and figured I'd play around with it and write a simple email address parser, only problem: I couldn't get it to work.

I tried countless iterations before landing on something that actually works, and I'm struggling to understand why.

All it boiled down to, was changing token to rule.

Here's my example code:

grammar Email {
  token TOP { <name> '@' [<subdomain> '.']* <domain> '.' <tld> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse('[email protected]');

doesn't work, it simply prints Nil, but

grammar Email {
  rule TOP { <name> '@' [<subdomain> '.']* <domain> '.' <tld> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse('[email protected]');

does work and correctly prints

[email protected]」
 name => 「foo.bar」
 subdomain => 「baz」
 domain => 「example」
 tld => 「com」

And all I changed was token TOP to rule TOP.

From what I can gather from the documentation, the only difference between those two keywords, is that whitespace is significant in rule, but isn't in token. If that's true, the first example should work, as I want to ignore the whitespace between the individual pieces of the pattern.

Removing the spaces between the pieces

rule TOP { <name>'@'[<subdomain>'.']*<domain>'.'<tld> }

reverts the behaviour back to printing Nil.

Anyone able to clue me in on what's going on here?

EDIT: Changing the TOP rule to a regex instead, which allows for backtracking makes it work too.

The question still remains, how come rule { } (which is the same as regex {:ratchet :sigspace }) matches when token { } (which is the same as regex {:ratchet }) doesn't?

The email address doesn't have any spaces in it, so for all intents and purposes it should fail right away

Pierpont answered 27/5, 2020 at 19:53 Comment(2)
@raiph I added [context-free-grammar] because I originally didn't know any better. A few hours later I had read more up on it all, and you're right, it is closer to unrestricted grammarPierpont
Friedl in "Mastering Regular Expressions" (O'Reilly, 1997; page 294 et seq) develops a Perl regex to match email addresses,4724 (!) bytes long. But even that one falls short on some corner cases...Tessellated
S
15

This answer explains the problem, provides a simple solution, and then goes deep.

The problem with your grammar

First, your SO demonstrates what seems to be either an extraordinary bug or a common misunderstanding. See JJ's answer for the issue he's filed to follow up on it, and/or my footnote.[4]

Putting the bug/"bug" aside, your grammar directs Raku to not match your input:

  • The [<subdomain> '.']* atom eagerly consumes the string 'baz.example.' from your input;

  • The remaining input ('com') fails to match the remaining atoms (<domain> '.' <tld>);

  • The :ratchet that's in effect for tokens means the grammar engine does not backtrack into the [<subdomain> '.']* atom.

Thus the overall match fails.

The simplest solution

The simplest solution to make your grammar work is to append ! to the [<subdomain> '.']* pattern in your token.

This has the following effect:

  • If any of the remainder of the token fails (after the subdomain atom), the grammar engine will backtrack to the subdomain atom, drop the last of its match repetitions, and then try move forward again;

  • If matching fails again the engine will again backtrack to the subdomain atom, drop another repetition, and try again;

  • The grammar engine will repeat the above actions until either the rest of the token matches or there are no matches of the [<subdomain> '.'] atom left to backtrack on.

Note that adding the ! to the subdomain atom means that the backtracking behaviour is limited to just the subdomain atom; if the domain atom matches, but then the tld atom did not, the token would fail instead of trying to backtrack. This is because the whole point of tokens is that, by default, they don't backtrack to earlier atoms after they've succeeded.

Playing with Raku, developing grammars, and debugging

Nil is fine as a response from a grammar that is known (or thought) to work fine, and you don't want any more useful response in the event of a parse fail.

For any other scenario there are much better options, as summarized in my answer to How can error reporting in grammars be improved?.

In particular, for playing around, or developing a grammar, or debugging one, the best option by far is to install the free Comma and use its Grammar Live View feature.

Fixing your grammar; general strategies

Your grammar suggests two three options1:

  • Parse forwards with some backtracking. (The simplest solution.)

  • Parse backwards. Write the pattern in reverse, and reverse the input and output.

  • Post parse the parse.

Parse forwards with some backtracking

Backtracking is a reasonable approach for parsing some patterns. But it is best minimized, to maximize performance, and even then still carries DoS risks if written carelessly.2


To switch on backtracking for an entire token, just switch the declarator to regex instead. A regex is just like a token but specifically enables backtracking like a traditional regex.

Another option is to stick with token and limit the part of the pattern that might backtrack. One way to do that is to append a ! after an atom to let it backtrack, explicitly overriding the token's overall "ratchet" that would otherwise kick when that atom succeeds and matching moves on to the next atom:

token TOP { <name> '@' [<subdomain> '.']*! <domain> '.' <tld> }
                                         🡅

An alternative to ! is to insert :!ratchet to switch "ratcheting" off for a part of a rule, and then :ratchet to switch ratcheting back on again, eg:

token TOP { <name> '@' :!ratchet [<subdomain> '.']* :ratchet <domain> '.' <tld> }  

(You can also use r as an abbreviation for ratchet, i.e. :!r and :r.)

Parse backwards

A classic parsing trick that works for some scenarios, is to parse backwards as a way to avoid backtracking.

grammar Email {
  token TOP { <tld> '.' <domain> ['.' <subdomain> ]* '@' <name> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse(flip '[email protected]').hash>>.flip;
#{domain => example, name => foo.bar, subdomain => [baz], tld => com}

Probably too complicated for most folks' needs but I thought I'd include it in my answer.

Post parse the parse

In the above I presented a solution that introduces some backtracking, and another that avoids it but with significant costs in terms of ugliness, cognitive load etc. (parsing backwards?!?).

There's another very important technique that I overlooked until reminded by JJ's answer.1 Just parse the results of the parse.


Here's one way. I've completely restructured the grammar, partly to make more sense of this way of doing things, and partly to demonstrate some Raku grammar features:

grammar Email {
  token TOP {
              <dotted-parts(1)> '@'
    $<host> = <dotted-parts(2)>
  }
  token dotted-parts(\min) { <parts> ** {min..*} % '.' }
  token parts { \w+ }
}
say Email.parse('[email protected]')<host><parts>

displays:

[「baz」 「buz」 「example」 「com」]

While this grammar matches the same strings as yours, and post-parses like JJ's, it's obviously very different:

  • The grammar is reduced to three tokens.

  • The TOP token makes two calls to a generic dotted-parts token, with an argument specifying the minimum number of parts.

  • $<host> = ... captures the following atom under the name <host>.

    (This is typically redundant if the atom is itself a named pattern, as it is in this case -- <dotted-parts>. But "dotted-parts" is rather generic; and to refer to the second match of it (the first comes before the @), we'd need to write <dotted-parts>[1]. So I've tidied up by naming it <host>.)

  • The dotted-parts pattern may look a bit challenging but it's actually pretty simple:

    • It uses a quantifier clause (** {min..max}) to express any number of parts provided it's at least the minimum.

    • It uses a modifier clause (% <separator>) which says there must be a dot between each part.

  • <host><parts> extracts from the parse tree the captured data associated with the parts token of the second use in the TOP rule of dotted-parts. Which is an array: [「baz」 「buz」 「example」 「com」].


Sometimes one wants some or all of the the reparsing to happen during the parsing, so that reparsed results are ready when a call to .parse completes.

JJ has shown one way to code what are called actions. This involved:

  • Creating an "actions" class containing methods whose names corresponded to named rules in the grammar;

  • Telling the parse method to use that actions class;

  • If a rule succeeds, then the action method with the corresponding name is called (while the rule remains on the call stack);

  • The match object corresponding to the rule is passed to the action meethod;

  • The action method can do whatever it likes, including reparsing what just got matched.

It's simpler and sometimes better to write actions directly inline:

grammar Email {
  token TOP {
              <dotted-parts(1)> '@'
    $<host> = <dotted-parts(2)>

    # The new bit:
    {
      make (subs => .[ 0 .. *-3 ],
            dom  => .[      *-2 ],
            tld  => .[      *-1 ])

      given $<host><parts>
    }

  }
  token dotted-parts(\min) { <parts> ** {min..*} % '.' }
  token parts { \w+ }
}
.say for Email.parse('[email protected]') .made;

displays:

subs => (「baz」 「buz」)
dom => 「example」
tld => 「com」

Notes:

  • I've directly inlined the code doing the reparsing.

    (One can insert arbitary code blocks ({...}) anywhere one could otherwise insert an atom. In the days before we had grammar debuggers a classic use case was { say $/ } which prints $/, the match object, as it is at the point the code block appears.)

  • If a code block is put at the end of a rule, as I have done, it is almost equivalent to an action method.

    (It will be called when the rule has otherwise completed, and $/ is already fully populated. In some scenarios inlining an anonymous action block is the way to go. In others, breaking it out into a named method in an action class like JJ did is better.)

  • make is a major use case for action code.

    (All make does is store its argument in the .made attribute of $/, which in this context is the current parse tree node. Results stored by make are automatically thrown away if backtracking subsequently throws away the enclosing parse node. Often that's precisely what one wants.)

  • foo => bar forms a Pair.

  • The postcircumfix [...] operator indexes its invocant:

    • In this case there's just a prefix . without an explicit LHS so the invocant is "it". The "it" was setup by the given, i.e. it's (excuse the pun) $<host><parts>.
  • The * in the index *-n is the invocant's length; so [ 0 .. *-3 ] is all but the last two elements of $<host><parts>.

  • The .say for ... line ends in .made3, to pick up the maked value.

  • The make'd value is a list of three pairs breaking out $<host><parts>.


Footnotes

1 I had truly thought my first two options were the two main ones available. It's been around 30 years since I encountered Tim Toady online. You'd think by now I'd have learned by heart his eponymous aphorism -- There Is More Than One Way To Do It!

2 Beware "pathological backtracking". In a production context, if you have suitable control of your input, or the system your program runs on, you may not have to worry about deliberate or accidental DoS attacks because they either can't happen, or will uselessly take down a system that's rebootable in the event of being rendered unavailable. But if you do need to worry, i.e. the parsing is running on a box that needs to be protected from a DoS attack, then an assessment of the threat is prudent. (Read Details of the Cloudflare outage on July 2, 2019 to get a real sense of what can go wrong.) If you are running Raku parsing code in such a demanding production environment then you would want to start an audit of code by searching for patterns that use regex, /.../ (the ... are metasyntax), :!r (to include :!ratchet), or *!.

3 There's an alias for .made; it's .ast. I think it stands for A Sparse Tree or Annotated Subset Tree and there's a cs.stackexchange.com question that agrees with me.

4 Golfing your problem, this seems wrong:

say 'a' ~~ rule  { .* a } # 「a」

More generally, I thought the only difference between a token and a rule was that the latter injects a <.ws> at each significant space. But that would mean this should work:

token TOP { <name> <.ws> '@' <.ws> [<subdomain> <.ws> '.']* <.ws>
            <domain> <.ws> '.' <.ws> <tld> <.ws>
} 

But it doesn't!

At first this freaked me out. Writing this footnote two months later, I'm feeling somewhat less freaked out.

Part of this is my speculation about the reason I've not been able to find anyone reporting this in the 15 years since the first Raku grammar prototype became available via Pugs. That speculation includes the possibility that @Larry deliberately designed them to work as they do, and it being a "bug" is primarily misunderstanding among the current crop of mere mortals like us trying to provide an explanation for why Raku does what it does based on our analysis of our sources -- roast, the original design documents, the compiler source code, etc.

In addition, given that the current "buggy" behaviour seems ideal and intuitive (except for contradicting the doc), I'm focusing on interpreting my feeling of great discomfort -- during this interim period of unknown length in which I don't understand why it gets it right -- as a positive experience. I hope others can too -- or, much better, figure out what is actually going on and let us know!

Shrum answered 27/5, 2020 at 22:6 Comment(8)
I'm not sure why say 'a' ~~ rule { .* a } # 「a」 seems problematic to you. Certainly managing wild-characters at the beginning of a string is a hassle, and rule seems to help that. Compare to use of the token keyword: say 'a' ~~ token { .* a } # Nil returns Nil, and (I imagine) could prove a frustrating experience to get working using the token keyword alone.Tyus
@Tyus say 'a' ~~ token { .* a } # Nil. That makes sense, because token does not backtrack. The .* swallows the a, does not backtrack, so there's no a left to match, so the match fails. But rule doesn't backtrack either. So why does it match? Does rule matching default to frugal contraire "By default, quantifiers request a greedy match"?Shrum
@Tyus Thanks for commenting. I've revisited the whole scene and have suddenly solved the "crime" (it's not a bug, just a misunderstanding). I've decided I will write another answer explaining it all (hopefully in the next few weeks, maybe a month or two); I will thank you in it for triggering my insight. :)Shrum
TY for the replies. I've been working on trying to sharpen/simplify whitespace-containing captures in the SO post that follows. Trying to implement Rules-vs-Tokens but am hitting an intellectual impasse. Thus I await your updated reply! https://mcmap.net/q/740884/-the-token-of-raku-grammar-doesn-39-t-hit-the-first-occurences-of-a-document-but-hits-the-similar-following-occurencesTyus
"But rule doesn't backtrack either. So why does it match?" My guess is for rules the Regex Engine "atomizes" desired-elements to include (and search for) significant whitespace. I think it may be instructive to look at the following four statements, because (significant) whitespace is what the rule keyword seems to enforce: say 'a' ~~ rule { .? a }; # 「a」 ; say 'a' ~~ rule { .?a }; # Nil; say 'aa' ~~ rule { .? a }; # Nil; say 'aa' ~~ rule { .?a }; # 「aa」Tyus
Right. Building on that, perhaps even more instructive is code like: say 'a' ~~ rule { ^ .? {say $/}a $ } etc (click to run what I mean in glot.io).Shrum
Well, I think I understand on an intuitive level what's going on, but I'm still perplexed as to the return values of some of your internal {$/} blocks. The quote from Synopsis_5 is thus "Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations." So maybe that's the answer in a nutshell. design.raku.org/S05.htmlTyus
say 'a' ~~ rule { .? a }; # 「a」 -- In a "scanning initialization" model (ratcheting, BTW), it's conceivable that an initial .?<.ws> atom gets filled with the a character, but the a immediately gets dropped on encountering the second . atom. The second . atom then picks up (gets filled with) the a and the entirety of the first .?<.ws> atom is deeemed (ignorable) "initial whitespace" and thrown away.Tyus
C
8

Edit: this is probably a bug, so the straight answer to the question is whitespace interpretation (in some restricted ways), although the answer in this case seems to be "ratcheting". It shouldn't be, however, and it only happens sometimes, which is why the bug report has been created. Thanks a lot for the question. Anyway, find below a different (and not possibly buggy) way to solve the grammar problem.


It's probably good to use Grammar::Tracer to check what's going on, just download it and put use Grammar::Tracer at the top. In the first case: Grammar with token

Tokens don't backtrack, so the <domain> token is gobbling up everything until it fails. Let's see what's going on with a rule

Grammar with rule

It does backtrack in this case. Which is surprising, since, well, it should not, according to the definition (and whitespace should be significant)

What can you do? It's probably better if you take into account backtracking when dividing the host.

use Grammar::Tracer;

grammar Email {
  token TOP { <name> '@' <host> }  
  token name { \w+ ['.' \w+]* }
    token host { [\w+] ** 2..* % '.' }
}
say Email.parse('[email protected]');

Here we make sure that we have at least two fragments, divided by a period.

And then you use actions to divide between the different parts of the host

grammar Email {
  token TOP { <name> '@' <host> }  
  token name { \w+ ['.' \w+]* }
  token host { [\w+] ** 2..* % '.' }
}

class Email-Action {
    method TOP ($/) {
    my %email;
    %email<name> = $/<name>.made;
    my @fragments = $/<host>.made.split("\.");
    %email<tld> = @fragments.pop;
    %email<domain> = @fragments.pop;
    %email<subdomain> = @fragments.join(".") if @fragments;
    make %email;

    }
    method name ($/) { make $/ }
    method host ($/) { make $/ }
}
say Email.parse('[email protected]', actions => Email-Action.new).made;

We pop twice since we know that, at least, we have a TLD and a domain; if there's anything left, it goes to subdomains. This will print, for this

say Email.parse('[email protected]', actions => Email-Action.new).made;
say Email.parse('[email protected]', actions => Email-Action.new).made;
say Email.parse('[email protected]', actions => Email-Action.new).made;

The correct answer:

{domain => example, name => 「foo.bar」, subdomain => baz, tld => com}
{domain => example, name => 「foo」, tld => com}
{domain => example, name => 「foo.bar.baz」, subdomain => quux.zuuz, tld => com}

Grammars are incredibly powerful, but also, with its depth-first search, somewhat difficult to debug and wrap your head around. But if there's a part that can be deferred to actions, which, besides, give you a ready-made data structure, why not use it?

I'm aware that does not really answer your question, why a token is behaving differently than a rule, and a rule is behaving as if it were a regex, not using whitespace and also doing ratcheting. I just don't know. The problem is that, in the way you have formulated your grammar, once it's gobbled up the period, it's not going to give it back. So either you somehow include the subdomain and domain in a single token so that it matches, or you will need a non-ratcheting environment like regexes (and, well, apparently rules too) to make it work. Take into account that token and regexes are very different things. They use the same notation and everything, but its behavior is totally different. I encourage you to use Grammar::Tracer or the grammar testing environment in CommaIDE to check the differences.

Character answered 28/5, 2020 at 7:27 Comment(2)
Tokens don't backtrack... which means that they ratchet instead. In your colorized graphic you purportedly show a Rule that does backtrack. Now what I'm trying to hunt down is an reference in the Perl6 foundational documents that says Rules can't.Tyus
Found it! "In particular, there are two special variants for use in grammars: token and rule. ...the rule declarator[-,] [+is used] for declaring non-terminal productions in a grammar. Like a token, it also does not backtrack by default." design.raku.org/S05.htmlTyus
P
3

As per the Raku docs:

  • Token methods are faster than regex methods and ignore whitespace. Token methods don't backtrack; they give up after the first possible match.
  • Rule methods are the same as token methods except whitespace is not ignored.

Not ignored means they are treated as syntax, rather than matched literally. They actually insert a <.ws>. See sigspace for more information about that.

Portative answered 27/5, 2020 at 22:3 Comment(11)
Yeah i know that, my original question also reflects the fact that I understand it doesn't ignore whitespace. But emails don't have whitespace, this shouldn't work!Pierpont
"emails don't have whitespace, this shouldn't work!" You're right it shouldn't work, but it's not because emails don't have whitespace. Raku is carefully designed with the intent of making things that are complicated in other languages become simple. One way to do that is pedagogical facilitation. ws is an examplar. The phrase "significant whitespace" is technically only about whitespace in the pattern, not input. Yes, the default ws can match whitespace. But it turns out that <ws> can also match where there is no whitespace!Shrum
@Shrum but if it matches where there's no whitespace, how can you then rely on it? If I want a rule constraint { NOT NULL | PRIMARY KEY } I don't want NOTNULL to be a valid matchPierpont
@ElectricCoffee ws's definition fits well with an average person's natural language intuition. So it's easy to rely on it if your intuition is an average person's intuition, and you're willing to go with your intuitive sense. Or if you're willing to look at the doc definition -- token { <!ww> \s* } -- and rely on that. rule constraint { NOT NULL | PRIMARY KEY } shouldn't match NOTNULL per intuition, and won't per the default ws rule. But rule { foo '@' bar } will match both foo @ bar and foo@bar. Should ws have been named otherwise? Pedagogical facilitators thought not. :)Shrum
@Shrum "rule constraint { NOT NULL | PRIMARY KEY } shouldn't match NOTNULL per intuition, and won't per the default ws rule. But rule { foo '@' bar } will matchboth foo @ bar and foo@bar" It's not at all obvious how the language would even distinguish between those two scenarios...Pierpont
@ElectricCoffee It's not at all obvious how. But that's Larry's forte. It accords with an average person's natural language intuition. In NOT NULL there's a "token" (not token!) break between the NOT and the NULL. In foo@bar there are likewise "token" breaks between foo and @ and bar. Or, if you're willing to look at the doc definition, it's <!ww> \s*, where ww means "between \w characters", and !ww means not between them, and \w goes per Unicode's "word character" property, and \s means whitespace characters, again per Unicode, and \s* means zero or more.Shrum
@Shrum the average person's intuition is full of inconsistencies and is generally unreliable thoughPierpont
@ElectricCoffee I think part of the issue here is a sly trick Larry has pulled with ws. What does it stand for? I used to think "whitespace". And I think he partly intends to light up those neurons. But I now realize another fit is "word separator", where "word" is a string of characters that match Unicode's definition of a "word character", and ws is consistent with the parsing concept of a token separator. So a token pattern is about matching a single token whereas a rule pattern is about matching a string of separated "tokens" aka "words". foo@bar is 2 "words" / 3 "tokens".Shrum
@ElectricCoffee Bingo! Now I've figured out the mystery. I'll be completely rewriting my answer, adding text to the bug report to explain it's not a bug, recommending doc changes, etc. It might take me a while (weeks) to do it, but it sure feels wonderful to have solved the riddle, with such a simple outcome, which I'll be delighted to share when I get to it. Your question mortified me, because I thought it might be a behavioural bug rather than just misunderstanding on our part, and it was sooooo basic! I'm so glad you asked it, even if it almost caused me nightmares at the time. :)Shrum
@Shrum I'm just doing my part trying to make things better :)Pierpont
@ElectricCoffee I think you played your role well! I plan to play mine again, and hope you'll continue to play yours as my new answer unfolds. (I've decided to write a separate new answer, rather than edit my current one. I expect to have at least two parts, with the first directly addressing your original example; my hope is you will apply the same careful thinking you've shown heretofore, and dialog with me about the first part, in comments under the new answer, until you fully and unreservedly agree with it, and signal that by at least considering switching your acceptance marker to it.)Shrum

© 2022 - 2024 — McMap. All rights reserved.