The .Net alternative to recursive pattern is a stack. The challenge here is that we need to express the grammar it terms of stacks.
Here's one way of doing that:
A slightly different notation for grammars
- First, we need grammar rules (like
A
and Q
in the question).
- We have one stack. The stack can only contain rules.
- At each step we pop the current state from the stack, match what we need to match, and push further rules into the stack. When we're done with a state we don't push anything and get back to the previous state.
This notation is somewhere between CFG and Pushdown automaton, where we push rules to the stack.
Example:
Let's start with a simple example: anbn. The usual grammar for this language is:
S -> aSb | ε
We can rephrase that to fit the notation:
# Start with <push S>
<pop S> -> "a" <push B> <push S> | ε
<pop B> -> "b"
In words:
- We start with
S
in the stack.
- When we pop
S
from the stack we can either:
- Match nothing, or...
- match "a", but then we have to push the state
B
to the stack. This is a promise we will match "b". Next we also push S
so we can keep matching "a"s if we want to.
- When we've matched enough "a"s we start popping
B
s from the stack, and match a "b" for each one.
- When this is done, we've matched an even number of "a"'s and "b"s.
or, more loosely:
When we're in case S, match "a" and push B and then S, or match nothing.
When we're in case B, match "b".
In all cases, we pop the current state from the stack.
Writing the pattern in a .Net regular expression
We need to represent the different states somehow. We can't pick '1' '2' '3' or 'a' 'b' 'c', because those may not be available in our input string - we can only match what is present in the string.
One option is to number our states (In the example above, S
would state number 0, and B
is state 1).
For state S𝒊 we can push 𝒊 characters to the stack. For convenience, we'll push the first 𝒊 characters from the start of the string. Again, we don't care what are these characters, just how many there are.
Push state
In .Net, if we want to push the first 5 characters of a string to a stack, we can write:
(?<=(?=(?<StateId>.{5}))\A.*)
This is a little convoluted:
(?<=…\A.*)
is a lookbehind that goes all the way to the start of the string.
- When we're at the start there's a look ahead:
(?=…)
. This is done so we can match beyond the current position - if we are at position 2, we don't have 5 characters before us. So we're looking back and looking forward.
(?<StateId>.{5})
push 5 characters to the stack, indicating that at some point we need to match state 5.
Pop state
According to our notation, we always pop the top state from the stack. That is easy: (?<-StateId>)
.
But before we do that, we want to know what state that was - or how many characters it captured. More specifically, we need to check explicitly for each state, like a switch
/case
block.
So, to check if the stack currently holds state 5:
(?<=(?=.{5}(?<=\A\k<StateId>))\A.*)
- Again,
(?<=…\A.*)
goes all the way to the start.
- Now we advance
(?=.{5}…)
by five characters...
- And use another lookbehind,
(?<=\A\k<StateId>)
to check that the stack really has 5 characters.
This has an obvious drawback - when the string is too short, we cannot represent the number of large states. This problem has solutions:
- The number of short words in the language is final anyway, so we can add them manually.
- Use something more complicated than a single stack - we can use several stacks, each with zero or one characters, effectively bits of our state (there's an example at the end).
Result
Our pattern for anbn looks like this:
\A
# Push State A, Index = 0
(?<StateId>)
(?:
(?:
(?:
# When In State A, Index = 0
(?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
(?:
# Push State B, Index = 1
(?<=(?=(?<StateId>.{1}))\A.*)
a
# Push State A, Index = 0
(?<StateId>)
|
)
)
|
(?:
# When In State B, Index = 1
(?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
b
)
|\Z
){2}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))
Working example on Regex Storm
Notes:
- Note that the quantifier around the states is
(?:(?:…){2})+
- that is, (state count)×(input length). I also added an alternation for \Z
. Let's not get into that, but it's a workaround for an annoying optimization in the .Net engine.
- The same can be written as
(?<A>a)+(?<-A>b)+(?(A)(?!))
- this is just an exercise.
To answer the question
The grammar from the question can be rewritten as:
# Start with <push A>
<pop A> -> <push A> ( @"," | <push Q> ) | ε
<pop Q> -> \w
| "<" <push Q2Close> <push A>
| "[" <push Q1Close> <push QStar> <push Q1Comma> <push QStar> <push Q1Semicolon> <push A>
<pop Q2Close> -> ">"
<pop QStar> -> <push QStar> <push Q> | ε
<pop Q1Comma> -> ","?
<pop Q1Semicolon> -> ";"
<pop Q1Close> -> "]"
The pattern:
\A
# Push State A, Index = 0
(?<StateId>)
(?:
(?:
(?:
# When In State A, Index = 0
(?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
(?:
# Push State A, Index = 0
(?<StateId>)
(?:
,
|
# Push State Q, Index = 1
(?<=(?=(?<StateId>.{1}))\A.*)
)
|
)
)
|
(?:
# When In State Q, Index = 1
(?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
(?:
\w
|
<
# Push State Q2Close, Index = 2
(?<=(?=(?<StateId>.{2}))\A.*)
# Push State A, Index = 0
(?<StateId>)
|
\[
# Push State Q1Close, Index = 6
(?<=(?=(?<StateId>.{6}))\A.*)
# Push State QStar, Index = 3
(?<=(?=(?<StateId>.{3}))\A.*)
# Push State Q1Comma, Index = 4
(?<=(?=(?<StateId>.{4}))\A.*)
# Push State QStar, Index = 3
(?<=(?=(?<StateId>.{3}))\A.*)
# Push State Q1Semicolon, Index = 5
(?<=(?=(?<StateId>.{5}))\A.*)
# Push State A, Index = 0
(?<StateId>)
)
)
|
(?:
# When In State Q2Close, Index = 2
(?<=(?=.{2}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
>
)
|
(?:
# When In State QStar, Index = 3
(?<=(?=.{3}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
(?:
# Push State QStar, Index = 3
(?<=(?=(?<StateId>.{3}))\A.*)
# Push State Q, Index = 1
(?<=(?=(?<StateId>.{1}))\A.*)
|
)
)
|
(?:
# When In State Q1Comma, Index = 4
(?<=(?=.{4}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
,?
)
|
(?:
# When In State Q1Semicolon, Index = 5
(?<=(?=.{5}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
;
)
|
(?:
# When In State Q1Close, Index = 6
(?<=(?=.{6}(?<=\A\k<StateId>))\A.*)
(?<-StateId>)
\]
)
|\Z
){7}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))
Sadly, it is too long to fit in a url, so no online example.
If we use "binary" stacks with one or zero characters, it would have looked like this: https://gist.github.com/kobi/8012361
Here's a screenshot of the pattern passing all tests: https://i.stack.imgur.com/IW2xr.png
Bonus
The .Net engine can do more than just to balance - it can also capture each instance of A
or Q
. This requires a slight modification of the pattern: https://gist.github.com/kobi/8156968 .
Note that we've added the groups Start
, A
and Q
to the pattern, but they have no effect of the flow, they are used purely for capturing.
The result: for example, for the string "[<a>b;<c,d>,<e,f>]"
, we can get these Capture
s:
Group A
(0-17) [<a>b;<c,d>,<e,f>]
(1-4) <a>b
(2-2) a
(7-9) c,d
(13-15) e,f
Group Q
(0-17) [<a>b;<c,d>,<e,f>]
(1-3) <a>
(2-2) a
(4-4) b
(6-10) <c,d>
(7-7) c
(9-9) d
(12-16) <e,f>
(13-13) e
(15-15) f
Open questions
- Can every grammar be converted to the state-stack notation?
- Is (state count)×(input length) enough steps to match all words? What other formula can work?
Source code
The code used to generate these patterns and all test cases can be found on https://github.com/kobi/RecreationalRegex