customising the parse return value, retaining unnamed terminals
Asked Answered
D

1

10

Consider the grammar:

TOP ⩴ 'x' Y 'z'
Y ⩴ 'y'

Here's how to get the exact value ["TOP","x",["Y","y"],"z"] with various parsers (not written manually, but generated from the grammar):

xyz__Parse-Eyapp.eyp

%strict
%tree

%%
start:
    TOP { shift; use JSON::MaybeXS qw(encode_json); print encode_json $_[0] };
TOP:
    'x' Y 'z'   { shift; ['TOP', (scalar @_) ? @_ : undef] };
Y:
    'y' { shift; ['Y', (scalar @_) ? @_ : undef] };

%%

xyz__Regexp-Grammars.pl

use 5.028;
use strictures;
use Regexp::Grammars;
use JSON::MaybeXS qw(encode_json);
print encode_json $/{TOP} if (do { local $/; readline; }) =~ qr{
<nocontext:>
<TOP>
<rule: TOP>
    <[anon=(x)]> <[anon=Y]> <[anon=(z)]>
    <MATCH=(?{['TOP', $MATCH{anon} ? $MATCH{anon}->@* : undef]})>
<rule: Y>
    <[anon=(y)]>
    <MATCH=(?{['Y', $MATCH{anon} ? $MATCH{anon}->@* : undef]})>

}msx;

Code elided for the next two parsers. With Pegex, the functionality is achieved by inheriting from Pegex::Receiver. With Marpa-R2, the customisation of the return value is quite limited, but nested arrays are possible out of the box with a configuration option.

I have demonstrated that the desired customisation is possible, although it's not always easy or straight-forward. These pieces of code attached to the rules are run as the tree is assembled.


The parse method returns nothing but nested Match objects that are unwieldy. They do not retain the unnamed terminals! (Just to make sure what I'm talking about: these are the two pieces of data at the RHS of the TOP rule whose values are 'x' and 'z'.) Apparently only data springing forth from named declarators are added to the tree.

Assigning to the match variable (analog to how it works in Regexp-Grammars) seems to have no effect. Since the terminals do no make it into the match variable, actions don't help, either.

In summary, here's the grammar and ordinary parse value:

grammar {rule TOP { x <Y> z }; rule Y { y };}.parse('x y z')

How do you get the value ["TOP","x",["Y","y"],"z"] from it? You are not allowed to change the shape of rules because that would potentially spoil user attached semantics, otherwise anything else is fair game. I still think the key to the solution is the match variable, but I can't see how.

Dresden answered 25/1, 2019 at 19:53 Comment(0)
O
8

Not a full answer, but the Match.chunks method gives you a few of the input string tokenized into captured and non-captured parts.

It does, however, does not give you the ability to distinguish between non-capturing literals in the regex and implicitly matched whitespace.

You could circumvent that by adding positional captures, and use Match.caps

my $m = grammar {rule TOP { (x) <Y> (z) }; rule Y { (y) }}.parse('x y z');

sub transform(Pair $p) {
    given $p.key {
        when Int { $p.value.Str }
        when Str { ($p.key, $p.value.caps.map(&transform)).flat }
    }
}

say $m.caps.map(&transform);

This produces

(x (Y y) z)

so pretty much what you wanted, except that the top-level TOP is missing (which you'll likely only get in there if you hard-code it).

Note that this doesn't cover all edge cases; for example when a capture is quantified, $p.value is an Array, not a match object, so you'll need another level of .map in there, but the general idea should be clear.

Orrery answered 25/1, 2019 at 20:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.