avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per @user0721090601's comment on your Q, and as a variant of @p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )>
capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before>
patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+
for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...
?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the *
quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢
is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/
is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... /
is analogous to an ordinary function call. The first 1
and first 123
of the output show what has been captured by that overall regex.
The ( ... )
sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" }
within it is analogous to a block of code. The 2
shows what it has captured.
The final 123
shows that, at the top level of the regex, $/
and $¢
have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... }
blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
\s+ <!before $>
That matches all stretches of whitespace, so long as it doesn't come right before the end of a string – Filmdom