Anti-matching against an infinite family of <!before> patterns in Raku
Asked Answered
P

4

9

I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.

Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.

say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
        | $ 
        | '_' <?before $> 
        | '_' <?before ['_' <?before $>]>
        | '_' <?before ['_' <?before ['_' <?before $>]>]>
        # ...
    ]>
]+
/;

Is there a way to construct the rest of the pattern implied by the ...?

Pandit answered 22/11, 2020 at 17:25 Comment(1)
Probably the easiest way to avoid matching whitespace at the end of a string, but matching it everwhere else is to go about it a very different way:: \s+ <!before $> That matches all stretches of whitespace, so long as it doesn't come right before the end of a stringFilmdom
H
3

avoid matching whitespace at the end of a string while still matching whitespace in the middle of words

Per Brad's answer, and your comment on it, something like this:

/ \w+ % \s+ /

what I'm looking for is a way to match arbitrarily long streams that end with a known pattern

Per @user0721090601's comment on your Q, and as a variant of @p6steve's answer, something like this:

/ \w+ % \s+ )> \s* $ /

The )> capture marker marks where capture is to end.

You can use arbitrary patterns on the left and right of that marker.

an infinite family of <!before> patterns

Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]

Is there a way to construct the rest of the pattern implied by the ...?

I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"

The answer is always "Yes":

  • While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.

  • Rules have arbitrary read access to capture state.[3]

  • Rules can do arbitrary turing complete computation.[4]

  • A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.

In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.

Footnotes

[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.

[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]

[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:

  • is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.

  • $/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.

For example:

'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
  • The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.

  • The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.

  • The final 123 shows that, at the top level of the regex, $/ and have the same value.

[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:

  • Rules can be invoked recursively;

  • Rules can have full signatures and pass arguments;

  • Rules can contain arbitrary code;

  • Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.

Heliotropism answered 23/11, 2020 at 16:19 Comment(0)
I
6

It is a little difficult to discern what you are asking for.


You could be looking for something as simple as this:

say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」

or

say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」

or

say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
Its answered 22/11, 2020 at 18:49 Comment(2)
I see @brad has (?) answered your actual question - here is the link to Modified Quantifier for Separators in the doc [docs.raku.org/language/regexes#Modified_quantifier:_%,_%%]Keramics
That's essentially the solution I settled on: / ['x'+]+ % '_'+ /. However I think ultimately what I'm looking for a is a way to match arbitrarily long streams that end with a known pattern. I will try to clarify the question a bit.Pandit
K
4

I would suggest using a Capture..., thusly:

'x_x___x________' ~~ /(.*?) _* $/; 
say $0;     #「x_x___x」

(The ? modifier makes the * 'non-greedy'.) Please let me know if I have missed the point!

Keramics answered 22/11, 2020 at 18:48 Comment(1)
or, more compactly, say ('x_x___x________' ~~ /(.*?) _* $/)[0];Keramics
H
3

avoid matching whitespace at the end of a string while still matching whitespace in the middle of words

Per Brad's answer, and your comment on it, something like this:

/ \w+ % \s+ /

what I'm looking for is a way to match arbitrarily long streams that end with a known pattern

Per @user0721090601's comment on your Q, and as a variant of @p6steve's answer, something like this:

/ \w+ % \s+ )> \s* $ /

The )> capture marker marks where capture is to end.

You can use arbitrary patterns on the left and right of that marker.

an infinite family of <!before> patterns

Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]

Is there a way to construct the rest of the pattern implied by the ...?

I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"

The answer is always "Yes":

  • While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.

  • Rules have arbitrary read access to capture state.[3]

  • Rules can do arbitrary turing complete computation.[4]

  • A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.

In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.

Footnotes

[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.

[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]

[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:

  • is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.

  • $/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.

For example:

'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
  • The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.

  • The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.

  • The final 123 shows that, at the top level of the regex, $/ and have the same value.

[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:

  • Rules can be invoked recursively;

  • Rules can have full signatures and pass arguments;

  • Rules can contain arbitrary code;

  • Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.

Heliotropism answered 23/11, 2020 at 16:19 Comment(0)
K
1

I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:

> say 'x x  x   ' ~~ m:g/ 'x'+  \s* /;    
(「x 」 「x  」 「x   」)    

> say 'x x  x   '.trim-trailing ~~ m:g/ 'x'+  \s* /;    
(「x 」 「x  」 「x」)

HTH.

https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading

Kemp answered 24/11, 2020 at 20:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.