Raku: effect of capture markers is lost "higher up"

Asked 15/8, 2020 at 13:4 Answered 15/8, 2020 at 20:13

The following Raku script:

#!/usr/bin/env raku
use v6.d;

grammar MyGrammar
{
    rule TOP { <keyword> '=' <value> }
    token keyword { \w+ }
    token value { <strvalue> | <numvalue> }
    token strvalue { '"' <( <-["]>* )> '"' }
    token numvalue { '-'? \d+ [ '.' \d* ]? }
}

say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');

has the following output:

｢foo = 42｣
 keyword => ｢foo｣
 value => ｢42｣
  numvalue => ｢42｣
｢bar = "Hello, World!"｣
 keyword => ｢bar｣
 value => ｢"Hello, World!"｣
  strvalue => ｢Hello, World!｣

For the second item, note that strvalue contains the string value without quotes, as intended with the capture markets <( ... )>. However, to my surprise, the quotes are included in value.

Is there a way around this?

Cluster answered 15/8, 2020 at 13:4 Comment(2)

hi @Cluster - please can you say a bit more about why this behaviour is problematic - it seems reasonable to me ... on the principle that the .raku method gives you something that will reparse to the original, you have 42 => Num and “xyz” => Str – Discredit 15/8, 2020 at 20:26

@p6steve, I don't know if it's problematic, but it's not what I expected and needed. And I don't think it is intuitive: I want a value, it is a string value or a numeric value. It it's a string value, I want the value without the surrounding quotes. I ask for that, and it puts the quotes back in the value. – Cluster 16/8, 2020 at 15:32

TL;DR Use "multiple dispatch".^[1,2] See @user0721090601's answer for a thorough explanation of why things are as they are. See @p6steve's for a really smart change to your grammar if you want your number syntax to match Raku's.

A multiple dispatch solution

Is there a way around this?

One way is to switch to explicit multiple dispatch.

You currently have a value token which calls specifically named value variants:

    token value { <strvalue> | <numvalue> }

Replace that with:

    proto token value {*}

and then rename the called tokens according to grammar multiple dispatch targeting rules, so the grammar becomes:

grammar MyGrammar
{
    rule TOP { <keyword> '=' <value> }
    token keyword { \w+ }
    proto token value {*}
    token value:str { '"' <( <-["]>* )> '"' }
    token value:num { '-'? \d+ [ '.' \d* ]? }
}

say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');

This displays:

｢foo = 42｣
 keyword => ｢foo｣
 value => ｢42｣
｢bar = "Hello, World!"｣
 keyword => ｢bar｣
 value => ｢Hello, World!｣

This doesn't capture the individual alternations by default. We can stick with "multiple dispatch" but reintroduce naming of the sub-captures:

grammar MyGrammar
{
    rule TOP { <keyword> '=' <value> }
    token keyword { \w+ }
    proto token value { * }
    token value:str { '"' <( $<strvalue>=(<-["]>*) )> '"' }
    token value:num { $<numvalue>=('-'? \d+ [ '.' \d* ]?) }
}

say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');

displays:

｢foo = 42｣
 keyword => ｢foo｣
 value => ｢42｣
  numvalue => ｢42｣
｢bar = "Hello, World!"｣
 keyword => ｢bar｣
 value => ｢Hello, World!｣
  strvalue => ｢Hello, World!｣

Surprises

to my surprise, the quotes are included in value.

I too was initially surprised.^[3]

But the current behaviour also makes sense to me in at least the following senses:

The existing behaviour has merit in some circumstances;
It wouldn't be surprising if I was expecting it, which I think I might well have done in some other circumstances;
It's not easy to see how one would get the current behaviour if it was wanted but instead worked as you (and I) initially expected;
There's a solution, as covered above.

Footnotes

^[1] Use of multiple dispatch^[2] is a solution, but seems overly complex imo given the original problem. Perhaps there's a simpler solution. Perhaps someone will provide it in another answer to your question. If not, I would hope that we one day have at least one much simpler solution. However, I wouldn't be surprised if we don't get one for many years. We have the above solution, and there's plenty else to do.

^[2] While you can declare, say, method value:foo { ... } and write a method (provided each such method returns a match object), I don't think Rakudo uses the usual multiple method dispatch mechanism to dispatch to non-method rule alternations but instead uses an NFA.

^[3] Some might argue that it "should", "could", or "would" "be for the best" if Raku did as we expected. I find I think my best thoughts if I generally avoid [sh|c|w]oulding about bugs/features unless I'm willing to take any and all downsides that others raise into consideration and am willing to help do the work needed to get things done. So I'll just say that I'm currently seeing it as 10% bug, 90% feature, but "could" swing to 100% bug or 100% feature depending on whether I'd want that behaviour or not in a given scenario, and depending on what others think.

Diary answered 15/8, 2020 at 15:30 Comment(7)

++ This is actually why I'm including a special is scouring trait in my binary regex work. Tokens can opt-in to having their match results modified (similar to <( )> but with more modifications possible). Like with stringy regex, the modifications don't get passed up… unless the enclosing token also opts in, and then the modified results are pervasive. – Signorelli 15/8, 2020 at 15:51

The catch is it requires a fair amount of data caching since you now have to store the modified set of data, and that's certainly why capture markers don't stick around: every match is at its core just two ints and a reference to the original string. If capture markers were pervasive, you'd need to start storing more data (a list of pairs of ints), and now concatanation is no where near as performant. – Signorelli 15/8, 2020 at 15:53

@Signorelli Great comments. Hopefully you'll come up with an abstract design that you're reasonably confident will admit concrete implementations for both binary and text regexes that are syntactically consistent with each other and as performant as they can be for their respective cases. Thanks for writing comments that pack in lots of extra value for this answer. I wrote a paragraph about the plausible underlying performance reasons in my answer prior to publishing it but ended up stripping it just before posting because it was too vague. Your comments are much better than what I had. :) – Diary 15/8, 2020 at 16:57

Thanks for your extensive answer. I think I won't need strvalue or numvalue, so multiple dispatch is the way to go for me. – Cluster 15/8, 2020 at 19:56

raiph: if you haven't seen the proposal, take a look at it and please leave any comments. gist.github.com/alabamenhu/2fec7a8f51a24091dc1b104a2ae2f04d If you look on my repositories on github, you can see a test implementation for the binary version. I'm taking a break until RakuAST is finished and focusing on international components, but once RAST is out, I'm going to pick it back up. – Signorelli 16/8, 2020 at 4:21

There is another way. Just change token value to be method value instead. grammar MyGrammar { …; method value ( |C ) { self.strvalue( |C ) || self.numvalue( |C ) }} – Gullett 19/8, 2020 at 2:14

@BradGilbert Oh, that's really nice. I'll add this to my answer. – Diary 19/8, 2020 at 10:36

The <( and )> capture markers only work within a given a given token. Basically, each token returns a Match object that says "I matched the original string from index X (.from) to index Y (.to)", which is taken into account when stringifying Match objects. That's what's happening with your strvalue token:

my $text = 'bar = "Hello, World!"';
my $m = MyGrammar.parse: $text;

my $start = $m<value><strvalue>.from;     # 7
my $end   = $m<value><strvalue>.to;       # 20
say $text.substr: $start, $end - $start;  # Hello, World!

You'll notice that there are only two numbers: a start and finish value. This mens that when you look at the value token you have, it can't create a discontiguous match. So it's .from is set to 6, and its .to to 21.

There are two ways around this: by using (a) an actions object or (b) a multitoken. Both have their advantages, and depending on how you want to use this in a larger project, you might want to opt for one or the other.

While you can technically define actions directly within a grammar, it's much easier to do them via a separate class. So we might have for you:

class MyActions { 
  method TOP      ($/) { make $<keyword>.made => $<value>.made }
  method keyword  ($/) { make ~$/ }
  method value    ($/) { make ($<numvalue> // $<strvalue>).made }
  method numvalue ($/) { make +$/ }
  method strvalue ($/) { make ~$/ }
}

Each level make to pass values up to whatever token includes it. And the enclosing token has access to their values via the .made method. This is really nice when, instead of working with pure string values, you want to process them first in someway and create an object or similar.

To parse, you just do:

my $m = MyGrammar.parse: $text, :actions(MyActions);
say $m.made; # bar => Hello, World!

Which is actually a Pair object. You could change the exact result by modifying the TOP method.

The second way you can work around things is to use a multi token. It's fairly common in developing grammars to use something akin to

token foo { <option-A> | <option-B> }

But as you can see from the actions class, it requires us to check and see which one was actually matched. Instead, if the alternation can acceptable by done with |, you can use a multitoken:

proto token foo { * }
multi token:sym<A> { ... }
multi token:sym<B> { ... }

When you use <foo> in your grammar, it will match either of the two multi versions as if it had been in the baseline <foo>. Even better, if you're using an actions class, you can similarly just use $<foo> and know it's there without any conditionals or other checks.

In your case, it would look like this:

grammar MyGrammar
{
    rule TOP { <keyword> '=' <value> }
    token keyword { \w+ }
    proto token value { * }
    multi token value:sym<str> { '"' <( <-["]>* )> '"' }
    multi token value:sym<num> { '-'? \d+ [ '.' \d* ]? }
}

Now we can access things as you were originally expecting, without using an actions object:

my $text = 'bar = "Hello, World!"';
my $m = MyGrammar.parse: $text;

say $m;        # ｢bar = "Hello, World!"｣
               #  keyword => ｢bar｣
               #  value => ｢Hello, World!｣

say $m<value>; # ｢Hello, World!｣

For reference, you can combine both techniques. Here's how I would now write the actions object given the multi token:

class MyActions { 
  method TOP            ($/) { make $<keyword>.made => $<value>.made }
  method keyword        ($/) { make ~$/ }
  method value:sym<str> ($/) { make ~$/ }
  method value:sym<num> ($/) { make +$/ }
}

Which is a bit more grokkable at first look.

Signorelli answered 15/8, 2020 at 15:46 Comment(0)

Rather than rolling your own token value:str & token value:num you may want to use Regex Boolean check for Num (+) and Str (~) matching - as explained to me here and documented here

token number { \S+ <?{ defined +"$/" }> }
token string { \S+ <?{ defined ~"$/" }> }

Discredit answered 15/8, 2020 at 20:13 Comment(3)

This is a great idea. Unfortunately the syntax as is won't quite work. You need the { … } to enclose everything, but the $/ includes everything matched up to that point: which is currently nothing. That's why @mortiz 's answer had the « \S+ » included to capture some text to test the coercion against. – Signorelli 16/8, 2020 at 4:31

thanks @Signorelli - I have edited my answer accordingly, and would also note that this example does not directly address the intent of the OP as it will include the surrounding "" - my point here is to mention how one can use the built in Raku syntax to detect if the captured text is a Num or Str rather than roll your own ... that way stuff like thousand markers, decimal points, etc. just work and this idea can be generalised to other types like Rat and Complex... – Discredit 16/8, 2020 at 21:42

@p6steve I've now pointed to this answer in mine. The numeric assertion is a nice tip. It's perfect if someone wants to accept any number raku would accept. And the general technique can be applied to many things. So readers ought to experience an "Oh! Nice!" moment. In contrast the string assertion is unhelpful imo. Any stringification of a match object is going to successfully coerce to a defined string, even if it's blank. So the assertion is redundant, which is probably going to be confusing. I get the desire for "consistency", but imo "A foolish consistency" in an example detracts. – Diary 17/8, 2020 at 22:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

A multiple dispatch solution

Surprises

Footnotes

Recommended topics

Hot tags