Does the Marpa parser library support error recovery?
Asked Answered
P

1

6

I know Perl's "Marpa" Earley parser has very good error reporting.

But I can't find in its documentation or via Googling whether it has error recovery.

For instance, most C/C++ compilers have error recovery, which they use to report multiple syntax errors where often other compilers stop at the first error.

I'm actually parsing natural language and wonder if there's a way to re-sync and resume parsing after one part of the input fails.


Example, for those who can grok it:

I'm parsing syllables in the Lao language. In Lao some vowels are diacritics which are encoded as separate characters and rendered above the previous consonant. In parsing random articles from the Lao Wikipedia I ran into some text where such a vowel was doubled. This is not allowed in Lao orthography so must be a typo. But I know that within a couple of characters the text is good again.

Anyway this is the real example which piqued my general interest in error recovery or re-synchronizing with the token stream.

Pallor answered 6/9, 2014 at 7:55 Comment(4)
You might wish to give an example of an the kind of error and recovery you have in mind. Marpa allows many ways to do things, and this would make it easier to give answers specific to your concernts.Professionalism
@JeffreyKegler: To start somewhere simple, assume there is a single extraneous token and after skipping it the token stream once again conforms to the grammar. But if there's a general introduction to error recovery techniques online that would be best since even though I've played with parsers for years I never got to error recovery before. I used to parse programming languages and now I'm natural language.Pallor
That might allow a fairly simple solution using the error rule technique I described in my comment on amon's answer, something like: "vowel ::= good_vowel | bad_vowel bad_vowel ::= vowel vowel"Professionalism
After reading the other answers and comments I can say the situation I ran into was not something I would've expected so it shouldn't have a special case handler. Just trying to skip junk until parsing works again would be the right approach. But I am interested in the other approaches too because I'm interested in the whole field, apart from my current project (-:Pallor
T
7

There are two possibilities for handling mistakes in Marpa.

“Ruby Slippers” Parsing

Marpa maintains a lot of context during scanning. We can use this context so that the parser can require some token, and we can decide whether we want to offer it to Marpa even if it isn't in the input. Consider for example a programming language that requires any statement to be terminated by a semicolon. We can then use Ruby Slippers techniques to insert semicolons at specific locations, such as at the end of a line, or before a closing brace:

use strict;
use warnings;
use Marpa::R2;
use Data::Dump 'dd';

my $grammar = Marpa::R2::Scanless::G->new({
    source => \q{
        :discard ~ ws
        Block         ::= Statement+ action => ::array
        Statement     ::= StatementBody (STATEMENT_TERMINATOR) action => ::first
        StatementBody ::= 'statement'       action => ::first
                      |   ('{') Block ('}') action => ::first
        STATEMENT_TERMINATOR ~ ';'
        event ruby_slippers = predicted STATEMENT_TERMINATOR
        ws ~ [\s]+
    },
});
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });

my $input = q(
    statement;
    { statement }
    statement
    statement
);

for (
  $recce->read(\$input);
  $recce->pos < length $input;
  $recce->resume
) {
    ruby_slippers($recce, \$input);
}
ruby_slippers($recce, \$input);
dd $recce->value;

sub ruby_slippers {
    my ($recce, $input) = @_;
    my %possible_tokens_by_length;

    my @expected = @{ $recce->terminals_expected };
    for my $token (@expected) {
        pos($$input) = $recce->pos;

        if ($token eq 'STATEMENT_TERMINATOR') {
            # fudge a terminator at the end of a line, or before a closing brace
            if ($$input =~ /\G \s*? (?: $ | [}] )/smxgc) {
                push @{ $possible_tokens_by_length{0} }, [STATEMENT_TERMINATOR => ';'];
            }
        }
    }

    my $max_length = 0;
    for (keys %possible_tokens_by_length) {
        $max_length = $_ if $_ > $max_length;
    }
    if (my $longest_tokens = $possible_tokens_by_length{$max_length}) {
        for my $lexeme (@$longest_tokens) {
            $recce->lexeme_alternative(@$lexeme);
        }
        $recce->lexeme_complete($recce->pos, $max_length);

        return ruby_slippers($recce, $input);
    }
}

In the ruby_slippers function, you could also count how often you needed to fudge a token. If that count exceeds some value, you could abandon the parse by throwing an error.

Skipping input

If your input may contain unparseable junk, you can try skipping that if no lexeme would otherwise be found. For this, the $recce->resume method takes an optional position argument, where the normal parsing will resume.

use strict;
use warnings;
use Marpa::R2;
use Data::Dump 'dd';
use Try::Tiny;

my $grammar = Marpa::R2::Scanless::G->new({
    source => \q{
        :discard ~ ws
        Sentence ::= WORD+ action => ::array
        WORD ~ 'foo':i | 'bar':i | 'baz':i | 'qux':i
        ws ~ [\s]+
    },
});
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });

my $input = '1) Foo bar: baz and qux, therefore qux (foo!) implies bar.';

try { $recce->read(\$input) };
while ($recce->pos < length $input) {
    # ruby_slippers($recce, \$input);
    try   {       $recce->resume                    }  # restart at current position
    catch { try { $recce->resume($recce->pos + 1) } }; # advance the position
    # if both fail, we go into a new iteration of the loop.
}
dd $recce->value;

While the same effect could be achieved with a :discard lexeme that matches anything, doing the skipping in our client code allows us to abort the parse if too much fudging had to be done.

Todd answered 6/9, 2014 at 12:13 Comment(1)
amon++. Another techique is error rules -- rules like "bad_widget ::= widget_prefix gadget widget_suffix", to recognize cases where an (illegal) attempt is made to include a gadget inside a widget. Or "bad_if ::= if '(' assignment ')' block", if you want to catch a top-level assignment in an if statement's conditional. Traditional parsers did not use error rules, because they barely had enough power to parse correct inputs. Marpa has power to spare.Professionalism

© 2022 - 2024 — McMap. All rights reserved.