Perl 6 Grammar doesn't match like I think it should
Asked Answered
D

2

6

I'm doing Advent of Code day 9:

You sit for a while and record part of the stream (your puzzle input). The characters represent groups - sequences that begin with { and end with }. Within a group, there are zero or more other things, separated by commas: either another group or garbage. Since groups can contain other groups, a } only closes the most-recently-opened unclosed group - that is, they are nestable. Your puzzle input represents a single, large group which itself contains many smaller ones.

Sometimes, instead of a group, you will find garbage. Garbage begins with < and ends with >. Between those angle brackets, almost any character can appear, including { and }. Within garbage, < has no special meaning.

In a futile attempt to clean up the garbage, some program has canceled some of the characters within it using !: inside garbage, any character that comes after ! should be ignored, including <, >, and even another !.

Of course, this screams out for a Perl 6 Grammar...

grammar Stream
{
    rule TOP { ^ <group> $ }

    rule group { '{' [ <group> || <garbage> ]* % ',' '}' }
    rule garbage { '<' [ <garbchar> | <garbignore> ]* '>' }

    token garbignore { '!' . }
    token garbchar { <-[ !> ]> }
}

This seems to work fine on simple examples, but it goes wrong with two garbchars in a row:

say Stream.parse('{<aa>}');

gives Nil.

Grammar::Tracer is no help:

TOP
|  group
|  |  group
|  |  * FAIL
|  |  garbage
|  |  |  garbchar
|  |  |  * MATCH "a"
|  |  * FAIL
|  * FAIL
* FAIL
Nil

Multiple garbignores are no problem:

say Stream.parse('{<!!a!a>}');

gives:

「{<!!a!a>}」
 group => 「{<!!a!a>}」
  garbage => 「<!!a!a>」
   garbignore => 「!!」
   garbchar => 「a」
   garbignore => 「!a」

Any ideas?

Domitiladomonic answered 9/12, 2017 at 12:11 Comment(3)
I would use token group { '{' ~ '}' [ <group> || <garbage> ]* % ',' } as it puts the { and } together.Volitant
@Brad: Could do that, well, without the final }, and there are some benefits to that, but I personally think that's less readable, since you can no longer simply read from left to right.Domitiladomonic
@Domitiladomonic The rationale for twiddles is that writing it left-to-right tends to produce bizarre error messages if parsing fails whereas a twiddle tends to produce nice error messages. In addition, while I considered the twiddle construct less readable when I first saw it, familiarity very quickly led me to find it more readable.Sofa
S
6

UPD Given that the Advent of code problem doesn't mention whitespace you shouldn't be using the rule construct at all. Just switch all the rules to tokens and you should be set. In general, follow Brad's advice -- use token unless you know you need a rule (discussed below) or a regex (if you need backtracking).


My original answer below explored why the rules didn't work. I'll leave it in for now.


TL;DR <garbchar> | contains a space. Whitespace that directly follows any atom in a rule indicates a tokenizing break. You can simply remove this inappropriate space, i.e. write <garbchar>| instead (or better still, <.garbchar>| if you don't need to capture the garbage) to get the result you seek.


As your original question allowed, this isn't a bug, it's just that your mental model is off.

Your answer correctly identifies the issue: tokenization.

So what we're left with is your follow up question, which is about your mental model of tokenization, or at least how Perl 6 tokenizes by default:

why ... my second example ... goes wrong with two garbchars in a row:

'{<aa>}'

Simplifying, the issue is how to tokenize this:

aa

The simple high level answer is that, in parsing vernacular, aa will ordinarily be treated as one token, not two, and, by default, Perl 6 assumes this ordinary definition. This is the issue you're encountering.

You can overrule this ordinary definition to get any tokenizing result you care to achieve. But it's seldom necessary to do so and it certainly isn't in simple cases like this.

I'll provide two redundant paths that I hope might lead folk to the correct mental model:

Excerpting from the "Obstacles" section of the wikipedia page on tokenization, and interleaving the excerpts with P6 specific discussion:

Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:

  • Punctuation and whitespace may or may not be included in the resulting list of tokens.

In Perl 6 you control what gets included or not in the parse tree using capturing features that are orthogonal to tokenizing.

  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.

  • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

By default, the Perl 6 design embodies an equivalent of these two heuristics.

The key thing to get is that it's the rule construct that handles a string of tokens, plural. The token construct is used to define a single token per call.

I think I'll end my answer here because it's already getting pretty long. Please use the comments to help us improve this answer. I hope what I've written so far helps.

Sofa answered 9/12, 2017 at 19:22 Comment(2)
Thanks for your extensive answer. I have a much better understanding of tokens and rules now.Domitiladomonic
You're welcome. I'll leave my answer as is then. Thanks for the feedback. :)Sofa
D
3

A partial answer to my own question: Change all the rules to tokens and it works. It makes sense, because the difference is :sigspace, which we don't need or want here. What I don't understand, though, is why it did work for some input, like my second example.

The resulting code is here, if you're interested.

Domitiladomonic answered 9/12, 2017 at 12:36 Comment(3)
I would argue that you should default to tokens, and only use rules where you are dealing with whitespace a lot.Volitant
The default ws implementation is token ws { <!ww> \s* }. The <!ww> means "not within a word", so it fails to match between two a's, but successfully matches zero characters between word- and non-word characters.Fart
Thanks, @moritz, that's the simple explanation of why my code sometimes works and sometimes doesn't, which I was looking for.Domitiladomonic

© 2022 - 2024 — McMap. All rights reserved.