Unable to write a grammar in perl6 for parsing lines with special characters
Asked Answered
R

3

6

I have the code in: https://gist.github.com/ravbell/d94b37f1a346a1f73b5a827d9eaf7c92

use v6;
#use Grammar::Tracer;


grammar invoice {

    token ws { \h*};
    token super-word {\S+};
    token super-phrase { <super-word> [\h  <super-word>]*}
    token line {^^ \h* [ <super-word> \h+]* <super-word>* \n};

    token invoice-prelude-start {^^'Invoice Summary'\n}
    token invoice-prelude-end {<line> <?before 'Start Invoice Details'\n>};

    rule invoice-prelude {
        <invoice-prelude-start>
        <line>*?
        <invoice-prelude-end>
        <line>
    }
}

multi sub MAIN(){ 

    my $t = q :to/EOQ/; 
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf $1234.00
    qwe {rq} [we-r_q] we
    Start Invoice Details 
    EOQ


    say $t;
    say invoice.parse($t,:rule<invoice-prelude>);
}

multi sub MAIN('test'){
    use Test;
    ok invoice.parse('Invoice Summary' ~ "\n", rule => <invoice-prelude-start>);

    ok invoice.parse('asdfa {sf} asd-[fasdf] #werwerw'~"\n", rule => <line>);
    ok invoice.parse('asdfawerwerw'~"\n", rule => <line>);

    ok invoice.subparse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
    ok invoice.parse('fasdff;kjaf asdf asderwret'~"\n"~'Start Invoice Details'~"\n",rule => <invoice-prelude-end>);
    done-testing;
}

I have not been able to figure out why the parse on the rule <invoice-prelude> fails with a Nil. Note that even .subparse also fails.

The tests for the individual tokens are passing as you can see by running MAIN with 'test' argument (except ofcourse the .parse on <invoice-prelude> fails because it does not the full string).

What should be modified in the rule <invoice-prelude> so that the whole string $t in MAIN() can be parsed correctly?

Rental answered 11/1, 2019 at 11:38 Comment(2)
Seems like there is a white space at the end of the line Start Invoice Details. This makes the lookahead regexp <?before 'Start Invoice Details'\n> fail, since it expects a newline at the end of the lineProvost
For debugging tips, see https://mcmap.net/q/587953/-how-can-error-reporting-in-grammars-be-improvedBurmaburman
D
8

Note that there is a hidden space at the end of the last line in the $t string:

my $t = q :to/EOQ/; 
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf $1234.00
    qwe {rq} [we-r_q] we
    Start Invoice Details␣   <-- Space at the end of the line
    EOQ

This makes the <invoice-prelude-end> token fail since it contains the a lookahead regexp <?before 'Start Invoice Details'\n>. This lookahead does not include a possible space at the end of the line (due to the explicit newline character \n at the end of the lookahead). Hence, the <invoice-prelude> rule cannot match either.

A quick fix is to remove the space at the end of the line Start Invoice Details.

Dhow answered 11/1, 2019 at 12:52 Comment(2)
Håkon Hægland: Amazed at the fine catch! It works now! Just curious how you were able to detect this. I know this is one of the common mistakes one makes during regex related programming. I was looking at the code repeatedly and missed it.Iit helped that some one else looked at the code and you did.Rental
@Rental Thanks! I first used the Grammar::Tracer to see if it could give me some indication of why the parse failed. This lead me to the lookahead regex. It seemed like this was the point where the problem occurred, but Grammar::Tracer did not reveal exactly what was the problem.. So I started to change the lookahead regex, first I removed the newline at the end of the lookahead, and I saw that now the parsing succeeded. After that it was easy to find the hidden space :)Provost
B
5

Firstly, the frugal quantifier *? without a backtracking probably every time match the empty string. You can use regex instead of rule.

Secondly, there is a space at the end of the line, which starts with Start Invoice Details.

rule invoice-prelude-end {<line> <?before 'Start Invoice Details' \n>};

regex invoice-prelude {
    <invoice-prelude-start>
    <line>*?
    <invoice-prelude-end>
    <line>
}

If you want to avoid a backtracking, you can use negative lookahead.

token invoice-prelude-end { <line> };

rule invoice-prelude {
    <invoice-prelude-start>
    [<line> <!before 'Start Invoice Details' \n>]*
    <invoice-prelude-end>
    <line>
}

Whole example with some changes as inspiration:

use v6;
#use Grammar::Tracer;


grammar invoice {
    token ws { <!ww>\h* }
    token super-word { \S+ }
    token line { <super-word>* % <.ws> }

    token invoice-prelude-start   { 'Invoice Summary' }
    rule  invoice-prelude-midline { <line> <!before \n <invoice-details-start> \n> }
    token invoice-prelude-end     { <line> }
    token invoice-details-start   { 'Start Invoice Details' }

    rule invoice-prelude {
        <invoice-prelude-start> \n
        <invoice-prelude-midline> * %% \n
        <invoice-prelude-end> \n
        <invoice-details-start> \n
    }
}

multi sub MAIN(){

    my $t = q :to/EOQ/;
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf $1234.00
    qwe {rq} [we-r_q] we
    Start Invoice Details 
    EOQ


    say $t;
    say invoice.parse($t,:rule<invoice-prelude>);
}
Bestiality answered 11/1, 2019 at 17:57 Comment(1)
What is the need for <!ww> in <ws>?Rental
H
4

TLDR: The issue is that the test input line with Start Invoice Details  ends with horizontal whitespace that you aren't dealing with.

Two ways to deal with it (other than changing the input)

# Explicitly:                                                       vvv
token invoice-prelude-end { <line> <?before 'Start Invoice Details' \h* \n>}

# Implicitly:
rule  invoice-prelude-end { <line><?before 'Start Invoice Details' \n>}
# ^ must be a rule                      and there must be a space ^
# (uses the fact that you wrote your own <ws> token)

Following are some more things that I think would be helpful

I would have used the “separated by” feature % in line and super-phrase

token super-phrase { <super-word>+ % \h } # single % doesn't capture trailing separator

token line {
  ^^ \h*
  <super-word>* %% \h+ # double %% can capture optional trailing separator
  \n
}

Those are [almost] exactly equivalent to what you wrote. (What you wrote has to fail to match <super-word> twice in <line>, but this only has to fail once.)


I would have used the surround feature ~ in invoice-prelude

token invoice-prelude {
    # zero or more <line>s surrounded by <invoice-prelude-start> and <invoice-prelude-end>
    <invoice-prelude-start> ~ <invoice-prelude-end> <line>*?

    <line> # I assume this is here for debugging
}

Note that it didn't actually gain anything by being a rule because all of the horizontal whitespace is already handled by the rest of the code.


I don't think that the last line of the invoice prelude is special, so remove <line> from invoice-prelude-end. (<line>*? in invoice-prelude will capture it instead.)

token invoice-prelude-end {<?before 'Start Invoice Details' \h* \n>}

The only regexs that could benefit from being a rule is invoice-prelude-start and invoice-prelude-end.

rule  invoice-prelude-start {^^ Invoice Summary \n}
# `^^` is needed  so the space ^ will match <.ws>

rule  invoice-prelude-end {<?before ^^ Start Invoice Details $$>}

That would only work if you are fine with it matching something like      Invoice    Summary    ␤.

Note that invoice-prelude-start needs to use \n to capture it, but invoice-prelude-end can use $$ instead because it isn't capturing \n anyway.


If you change super-word to something other than \S+, then you may also want to change ws to something like \h+ | <.wb>. (word boundary)


#! /usr/bin/env perl6
use v6.d;

grammar invoice {
    token TOP { # testing
         <invoice-prelude>
         <line>
    }

    token ws { \h* | <.wb> };
    token super-word { \S+ };
    token super-phrase { <super-word>+ % \h }
    token line {
        ^^ \h*
        <super-word>* %% \h+
        \n
    };

    rule invoice-prelude-start {^^ Invoice Summary \n}
    rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>};

    token invoice-prelude {
        <invoice-prelude-start> ~ <invoice-prelude-end>
            <line>*?
    }
}

multi sub MAIN(){ 
    my $t = q :to/EOQ/; 
    Invoice Summary
    asd fasdf
    asdfasdf
    asd 123-fasdf $1234.00
    qwe {rq} [we-r_q] we
    Start Invoice Details 
    EOQ


    say $t;
    say invoice.parse($t);
}
Hyland answered 11/1, 2019 at 19:42 Comment(4)
Brad Gilbert: Thorough and elegant answer! I would have chosen this as answer but for the fact that I have already accepted the first one that gave me the solution. Highly likely I will implement your solution. Learning a lot from this answer. Thanks!Rental
@Rental I don't really answer for reputation points anymore. I mostly do it to increase the level of knowledge of anyone who comes across it.Hyland
Why did you define <ws> with the extra <.wb>? Will that help in the context of my need for a <super-word> as \S+? If I understand <wb> is a boundary around <alnum>.Rental
@Rental Let's say that in 'abc  {' that abc and { should be matched separately. Now let's say we want abc{ to work exactly the same. That is a word boundary so <wb> will match. The built-in <ws> token acts a bit like <.wb> | \s+. Adding <.wb> was just a bit of future proofing.Hyland

© 2022 - 2024 — McMap. All rights reserved.