Capturing what's inside a nested structure in a regex or grammar token
Asked Answered
J

1

8

I'd like to capture the interior of a nested structure.

my $str = "(a)";
say $str ~~ /"(" ~ ")" (\w) /;
say $str ~~ /"(" ~ ")" <(\w)> /;
say $str ~~ /"(" <(~)> ")" \w /;
say $str ~~ /"(" <(~ ")" \w /;

The first one works; the last one works but also captures the closing parenthesis. The other two fail, so it's not possible to use capture markers in this case. But the problem is more complicated in the context of a grammar, since capturing groups do not seem to work either, like here:

# Please paste this together with the code above so that it compiles.
grammar G {
    token TOP {
              '(' ~ ')' $<content> = .+?
    }
}

grammar H {
    token TOP {
              '(' ~ ')' (.+?)
    }
}

grammar I {
    token TOP {
              '(' ~ ')' <( .+? )>
    }
}

$str = "(one of us)";
for G,H,I -> $grammar {
    say $grammar.parse( $str );
}

Since neither capturing grouping or capture markers seem to work, except if it's assigned, on the fly, to a variable. This, however, creates an additional token I'd really like to avoid. So there are two questions

  • What is the right way to make capture markers work in nested structures?
  • Is there a way to use either capturing groups or capturing markers in tokens to get the interior of a nested structure?
Jari answered 4/7, 2020 at 11:59 Comment(2)
What is dire about this situation... you don't know the problem so that seems like a bit much? Also, have you tried running the code posted to see if it compiles (it doesn't)? Regardless, your grammar I should be '(' ~ ')' [<( .+? )>]Diligent
It's split; if you post them together, it does. Added for clarification. Also why does one need to use a non-capture grouper here?Jari
C
6

One solution to two issues

  • Per ugexe's comment, the [...] grouping construct works for all your use cases.

  • The <( and )> capture markers are not grouping constructs so they don't work with the regex ~ operation unless they're grouped.

  • The (...) capture/grouping construct clamps frugal matching to its minimum match when ratchet is in effect. A pattern like :r (.+?) never matches more than one character.

The behaviors described in the last two bullet points above aren't obvious, aren't in the docs, may not be per the design docs, may be holes in roast, may be figments of my imagination, etc. The rest of this answer explains what I've found out about the above three cases, and discusses some things that could be done.

Glib explanation, as if it's all perfectly cromulent

<( and )> are capture markers.

They behave as zero width assertions. Each asserts "this marks where I want capturing to start/end for the regex that contains this marker".


Per the doc for the regex ~ operator:

it mostly ignores the left argument, and operates on the next two [arguments]

(The doc says "atoms" where I've written "arguments". In reality it operates on the next two atoms or groups.)

In the regex pattern "(" ~ ")" <(\w)>:

  • ")" is the first atom/group after ~.

  • <( is the second atom/group after ~.

  • ~ ignores \w)>.


The solution is to use [...]:

say '(a)' ~~ / '(' ~ ')' [ <( \w )> ] /; # 「a」

Similarly, in a grammar:

token TOP { '(' ~ ')' [ <( .+? )> ] }

(...) grouping isn't what you want for two reasons:

  • It couldn't be what you want. It would create an additional token capture. And you wrote you'd like to avoid that.

  • Even if you wanted the additional capture, using (...) when ratchet is in effect clamps frugal matching within the parens.

What could be done about capture markers "not working"?

I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.

Is it known to be intended behavior or a bug?

Searches of GH repos for "capture markers":

The term "capture markers" comes from the doc, not the old design docs which just say:

A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. When matched, these behave as assertions that are always true, but have the side effect of setting the .from and .to attributes of the match object.

(Maybe you can figure out from that what strings to search for among issues etc...)

At the time of writing, all GH searches for <( or )> draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos, eg this.


I was curious and tried this:

my $str = "aaa";
say $str ~~ / <(...)>* /;

It infinitely loops. The * is acting on just the )>. This corroborates the sense that capture markers are treated as atoms.


The regex ~ operator works for [...] and some other grouped atom constructions. Parsing any of them has a start and end within a regex pattern.

The capture markers are different in that they aren't necessarily paired -- the start or end can be implicit.

Perhaps this makes treating them as we might wish unreasonably difficult for Raku given that start (/ or{) and end ( / or }) occur at a slang boundary and Raku is a single-pass parsing braid?


I think that a doc fix is probably the appropriate response to this capture marker aspect of your SO.

If regex ~ were the only regex construct that cared that left and right capture markers are each an individual atom then perhaps the best place to mention this wrinkle would be in the regex ~ section.

But given that multiple regex constructs care (quantifiers do per the above infinite loop example), then perhaps the best place is the capture markers section.

Or perhaps it would be best if it's mentioned in both. (Though that's a slippery slope...)

What could be done about :r (.*?) "not working"?

I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.

Is it known to be intended behavior or a bug?

Searches of GH repos for ratchet frugal:

The terms "ratchet" and "frugal" both come from the old design docs and are still used in the latest doc and don't seem to have aliases. So searches for them should hopefully match all relevant mentions.

The above searches are for both words. Searching for one at a time may reveal important relevant mentions that happen to not mention the other.

At the time of writing, all GH searches for .*? or similar draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos.


Perhaps the issue here is broader than the combination of ratchet, frugal, and capture?

Perhaps file an issue using the words "ratchet", "frugal" and "capture"?

Capitoline answered 4/7, 2020 at 19:14 Comment(7)
Made an edit to clarify. Still not clear why H does not work. Atomization stops somewhere the starting parenthesis, and grouping does not really work.Jari
Answering this SO suggested a simple technique to me to capture just the last match of multiple matches. Perhaps it's obvious, but in case not, I'm writing it here in this comment to add it to our collective SO knowledge: my $str = "aaa"; say $str ~~ / [<(\w)>]* /; # 「a」.Capitoline
@Jari You accepted before you published your comment. And my answer at that time accidentally hid the problem with H (I hadn't understood that there was actually what looks like a bug in Rakudo; is it a bug? are you aware of a corresponding filed issue?); and did a poor job of addressing your other points. Most of my work on this answer has come after you accepted it. To be clear, acceptance isn't what I care about. My main aim is writing useful answers and comments for askers, readers, rakuns, and commenters (with the occasional poem, question, or whatever thrown in for good measure).Capitoline
you did a great job. I just didn't think you could make it better when I accepted it, yet you did, so kudos.Jari
@Jari What's your take on clamping?Capitoline
Sorry, what do yo mean here?Jari
I mean the parts of my answer that discuss clamping. For example "The (...) capture/grouping construct clamps...". More importantly, what's your take on the possible reactions to that covered in the section What could be done about :r (.*?) "not working"?. At minimum, is it a bug (I suspect so) and are you aware of a corresponding filed issue?Capitoline

© 2022 - 2024 — McMap. All rights reserved.