How do I access the captures within a match?
Asked Answered
T

2

5

I am trying to parse a csv file, and I am trying to access names regex in proto regex in Perl6. It turns out to be Nil. What is the proper way to do it?

grammar rsCSV {
    regex TOP { ( \s* <oneCSV> \s* \, \s* )* }
    proto regex oneCSV {*}
          regex oneCSV:sym<noQuote> { <-[\"]>*?  }
          regex oneCSV:sym<quoted>  { \" .*? \" } # use non-greedy match
}

my $input = prompt("Enter csv line: "); 

my $m1 = rsCSV.parse($input);
say "===========================";
say $m1;
say "===========================";
say "1 " ~ $m1<oneCSV><quoted>;  # this fails; it is "Nil"
say "2 " ~ $m1[0];
say "3 " ~ $m1[0][2];
Tyndale answered 24/11, 2016 at 7:6 Comment(0)
L
5

Detailed discussion complementing Christoph's answer

I am trying to parse a csv file

Perhaps you are focused on learning Raku parsing and are writing some throwaway code. But if you want industrial strength CSV parsing out of the box, please be aware of the Text::CSV modules[1].

I am trying to access a named regex

If you are learning Raku parsing, please take advantage of the awesome related (free) developer tools[2].

in proto regex in Raku

Your issue is unrelated to it being a proto regex.

Instead the issue is that, while the match object corresponding to your named capture is stored in the overall match object you stored in $m1, it is not stored precisely where you are looking for it.

Where do match objects corresponding to captures appear?

To see what's going on, I'll start by simulating what you were trying to do. I'll use a regex that declares just one capture, a "named" (aka "Associative") capture that matches the string ab.

given 'ab'
{
    my $m1 = m/ $<named-capture> = ( ab ) /;

    say $m1<named-capture>;
    # 「ab」
}

The match object corresponding to the named capture is stored where you'd presumably expect it to appear within $m1, at $m1<named-capture>.

But you were getting Nil with $m1<oneCSV>. What gives?

Why your $m1<oneCSV> did not work

There are two types of capture: named (aka "Associative") and numbered (aka "Positional"). The parens you wrote in your regex that surrounded <oneCSV> introduced a numbered capture:

given 'ab'
{
    my $m1 = m/ ( $<named-capture> = ( ab ) ) /; # extra parens added

    say $m1[0]<named-capture>;
    # 「ab」
}

The parens in / ( ... ) / declare a single top level numbered capture. If it matches, then the corresponding match object is stored in $m1[0]. (If your regex looked like / ... ( ... ) ... ( ... ) ... ( ... ) ... / then another match object corresponding to what matches the second pair of parentheses would be stored in $m1[1], another in $m1[2] for the third, and so on.)

The match result for $<named-capture> = ( ab ) is then stored inside $m1[0]. That's why say $m1[0]<named-capture> works.

So far so good. But this is only half the story...

Why $m1[0]<oneCSV> in your code would not work either

While $m1[0]<named-capture> in the immediately above code is working, you would still not get a match object in $m1[0]<oneCSV> in your original code. This is because you also asked for multiple matches of the zeroth capture because you used a * quantifier:

given 'ab'
{
    my $m1 = m/ ( $<named-capture> = ( ab ) )* /; # * is a quantifier

    say $m1[0][0]<named-capture>;
    # 「ab」
}

Because the * quantifier asks for multiple matches, Raku writes a list of match objects into $m1[0]. (In this case there's only one such match so you end up with a list of length 1, i.e. just $m1[0][0] (and not $m1[0][1], $m1[0][2], etc.).)

Summary

  • Captures nest;

  • A capture quantified by either * or + corresponds to two levels of nesting not just one.

  • In your original code, you'd have to write say $m1[0][0]<oneCSV>; to get to the match object you're looking for.


[1] Install relevant modules and write use Text::CSV; (for a pure Raku implementation) or use Text::CSV:from<Perl5>; (for a Perl plus XS implementation) at the start of your code. (talk slides (click on top word, eg. "csv", to advance through slides), video, Raku module, Perl XS module.)

[2] Install CommaIDE and have fun with its awesome grammar/regex development/debugging/analysis features. Or install the Grammar::Tracer; and/or Grammar::Debugger modules and write use Grammar::Tracer; or use Grammar::Debugger; at the start of your code (talk slides, video, modules.)

Ledbetter answered 25/11, 2016 at 6:34 Comment(3)
Thank you so much, raiph !!! I see my problems now after your detailed explanation. Thank you so much for your time !!!Tyndale
@Tyndale You're welcome. If you can tell me which particular bit(s) was/were most helpful to you that would be particularly helpful to me. :)Ledbetter
Thanks raiph. It is your explanation of the association between named/numbered capture and match object tree and the fact that, with *, perl6 constructs a list instead of a single object. Thanks again!Tyndale
E
3

The match for <oneCSV> lives within the scope of the capture group, which you get via $m1[0].

As the group is quantified with *, the results will again be a list, ie you need another indexing operation to get at a match object, eg $m1[0][0] for the first one.

The named capture can then be accessed by name, eg $m1[0][0]<oneCSV>. This will already contain the match result of the appropriate branch of the protoregex.

If you want the whole list of matches instead of a specific one, you can use >> or map, eg $m1[0]>>.<oneCSV>.

Erving answered 24/11, 2016 at 10:7 Comment(1)
Thank you Christoph. My understanding of Perl 6 has been furthered by your answers !Tyndale

© 2022 - 2024 — McMap. All rights reserved.