Grammar a bit too greedy in Perl6
Asked Answered
C

3

7

I am having problems with this mini-grammar, which tries to match markdown-like header constructs.

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}
}

I would like it to match ## Easier ## as a header, but instead it takes ## as part of span:

TOP
|  header
|  |  hashes
|  |  * MATCH "##"
|  |  span
|  |  |  like-a-word
|  |  |  * MATCH "Easier"
|  |  |  like-a-word
|  |  |  * MATCH "##"
|  |  |  like-a-word
|  |  |  * FAIL
|  |  * MATCH "Easier ##"
|  * MATCH "## Easier ##"
* MATCH "## Easier ##\n"
「## Easier ##
」
 header => 「## Easier ##」
  hashes => 「##」
  span => 「Easier ##」
   like-a-word => 「Easier」
   like-a-word => 「##」

The problem is that the [\h* $0]? simply does not seem to work, with span gobbling up all available words. Any idea?

Conflagration answered 5/1, 2018 at 9:1 Comment(7)
Try ? after *.Deaden
If you mean in the Span definition, I did. Same result.Conflagration
Probably I don't understand something, but what would you expect $0 to match, when there are no positional captures?Fermi
That is absolutely right. It was a left over from old attempts.Conflagration
Fyi: You should always use token unless you know you need one of the other options (regex, rule, and method).Youthful
Only use regex if you are sure you need backtracking, because backtracking will frequently unnecessarily make parsing run literally millions of times slower than necessary, or worse. If you switch all your regex declarations to token you'll see that your code will continue to parse correctly (at least for your trial input "## Easier ##\n") but will quite plausibly run vastly faster on large or complex inputs.Youthful
I think I need backtracking here. "## Easy Peasy ##" will fail, for instance. I can change the lowest level, like-a-word, to a token, though.Conflagration
B
5

First, as others have pointed out, <hashes> does not capture into $0, but instead, it captures into $<hashes>, so you have to write:

regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}

But that still doesn't match the way you want, because the [\h* $<hashes>]? part happily matches zero occurrences.

The proper fix is to not let span match ## as a word:

role Like-a-word {
    regex like-a-word { <!before '#'> \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}
}

say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);

If you are loath to modify like-a-word, you can also force the exclusion of a final # from it like this:

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> <!after '#'> [\h* $<hashes>]? $$}
}

say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);
Brom answered 6/1, 2018 at 0:11 Comment(3)
It's fine, except that I might want to capture # in "like-a-word". Markdown can't fail, so if I have something like ### not a header ## I would want ## be interpreted like-a-word. So I guess the first one is fine, but leaving like-a-word just the way it was. Thanks a lot!Conflagration
@Conflagration Food for thought / tests: What's supposed to happen with ## two hashes on left, three on right ###; and ## two hashes on left, two plus two on right ## ##; and ## two hashes on left, two plus three on right ## ###; and ## two hashes on left, three plus two on right ### ##; and ## ## two plus two hashes on left, two plus two on right ## ##; and ## two hashes on left, two plus two on right ## ##; and ## two hashes on left, two in middle ## and some more text ##;?Youthful
@Conflagration you can always try to do the stricter parse first, and then use || to fall back to something that always matches.Brom
C
4

Just change

  regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}

to

  regex header {^^ (<hashes>) \h+ <span> [\h* $0]? $$}

So that the capture works. Thanks to Eugene Barsky for calling this.

Conflagration answered 5/1, 2018 at 9:59 Comment(0)
D
3

I played with this a bit because I thought there were two interesting things you might do.

First, you can make hashes take an argument about how many it will match. That way you can do special things based on the level if you like. You can reuse hashes in different parts of the grammar where you require different but exact numbers of hash marks.

Next, the ~ stitcher allows you to specify that something will show up in the middle of two things so you can put those wrapper things next to each other. For example, to match (Foo) you could write '(' ~ ')' Foo. With that it looks like I came up with the same thing you posted:

use Grammar::Tracer;

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* }
}

grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes ( $n = 1 ) { '#' ** {$n} }

    regex header { [(<hashes(2)>) \h*] ~ [\h* $0] <span>  }
}

my $result = Grammar::Headers.parse( "## Easier ##\n" );

say $result;
Deformation answered 6/1, 2018 at 17:24 Comment(2)
Thanks for the answer. I wonder how hashes will show up in the Match object. Plus, will I need to declare also header in the same way, using $n as a parameter?Conflagration
I think you could declare header to take a parameter and then pass that on to something below it. However, I'd probably lean towards making header1, header2, and so on. That might make the AST easier when you want to play with it. But, I hadn't thought that long on it. :)Deformation

© 2022 - 2024 — McMap. All rights reserved.