I wrote the original implementation of this code. It has changed somewhat since then, which has slightly changed the time/space tradeoffs for how it works.
First, I need to call out one misunderstanding in your post, the rules for Perls regex engine are not to match the longest string, but rather to match the leftmost-longest string, in particular
"foal"=~/(f|fo|foa|foal)/
should match "f" not "foal" as the "f" option is first. This is called the "word order" in the comments in the code. You can see this with Perl by injecting (?{ print $& }) into your patterns:
perl -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/'
f
foa
fo
Another point I thought I should explain is that the code and comments use the term "state" and "accepting state" a lot. This comes from the observation that a Trie is equivalent to an acyclic DFA, and can be naively represented as a table of one row per state, with one column per legal character in the input alphabet and with an additional column being used to track whether the state is "accepting", meaning "ends one of the alternatives", and if so which branch of the alternation it terminates. The values in the other columns represent the new state to transition to after reading the columns character. An accepting state may have transitions to more states, as in the case where one branch matches a prefix of another.
Matching then comes down to walking the tree/state table, when we encounter an accepting state we push a 2-tuple of the number of characters we have read so far in the trie and the word number associated with the state into a list. Eventually we either run out of string to read, or encounter a transition to the 0 state which means we can stop as no further matches are possible.
Once you have a list of the tuples of (length,word) you can then go through them in order of word ascending, since such lists are generally short IMO it is reasonable to simply use linear probing to find the minimum word each try. Perl uses a strategy where it precompiles all such possible lists into one structure in advance, so it need only keep track of the branch number of the most recent accepting state, and then can (somewhat expensively) compute any preceeding accepting states from that. For instance the example below /(f|foa|fo)al/ produces a structure like this:
word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)
Which shows that if we reach the accepting state for "foa" which is branch 2 we have matched a 3 character string and there is a preceding accepting state of branch 3, which in turn points at 1. From this we can reconstruct the behavior we would expect to see, which is to try "al" after "f", "foa" and then "fo". The process is somewhat expensive (N**2) compared to other solutions, but has the advantage of being precompiled and reusable and the storage requirements fit nicely with how the regex backtracking engine works. IMO it doesnt matter, the main issue is that somehow you track the accepting states where you were in the string when you encountered them, and then try the "tail" in the right order.
You can see much of this at work with extended debugging:
perl -Mre=Debug,TRIE,TRIEE,TRIEM -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/'
Compiling REx "(?:f|foa|fo)(?{print $&})al"
Looking for TRIE'able sequences. Tail node is: EVAL
- BRANCH (1) -> EXACT <f> => EVAL (First==-1,Last==-1,Cur==1,tt==END,nt==EXACT,nnt==END)
- BRANCH (4) -> EXACT <foa> => EVAL (First==1,Last==-1,Cur==4,tt==EXACT,nt==EXACT,nnt==END)
- BRANCH (7) -> EXACT <fo> => EVAL (First==1,Last==4,Cur==7,tt==EXACT,nt==EXACT,nnt==END)
- TAIL (10) <SCAN FINISHED>
make_trie start==1, first==1, last==10, tail==11 depth=1
TRIE(NATIVE): W:3 C:6 Uq:3 Min:1 Max:3
Compiling trie using table compiler
Char : f o a
State+-------------
1 : 2 . . ( 1)
2 : . 3 . ( 1) W 1
3 : . . 4 ( 1) W 3
4 : . . . ( 0) W 2
Here you can see the DFA/TRIE equivalency and the word data, and how the table is constructed. Because there are only 3 legal characters we build a table that has 4 columns, one per each legal character, and one to track if the state is accepting and if so which branch of the alternation it accepts. The W 2
on row 4 shows that it is accepting for the 2nd branch of the alternation.
Alloc: 22 Orig: 13 elements, Final:3. Savings of %76.92
Statecount:5 Lasttrans:4
Char : Match Base Ofs f o a
State|-----------------------------------
# 1| @ 3 + 0[ 2 . .]
# 2| W 1 @ 3 + 1[ . 3 .]
# 3| W 3 @ 3 + 2[ . . 4]
# 4| W 2 @ 0
This shows the naive table representation being packed into a compressed form that allows the zero transition in one row to be used to store non-zero transitions in other rows. Each transition is stored in an array of 2-tuple of [owner-state,transition], with two side arrays, one which stores the start position of a given state, and another that stores the accepting mappings for a given state, state transitions involve checking if table[start_offset[state]+column_ofs].owner_state is the same as state, and if not then table[start_offset[state]+column_ofs].new_state can be assumed to be 0. See the Red Dragon for more details.
word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)
this part of the output shows how we have precomputed the length, and the predecessor branch number for every accepting state. This way we don't have to construct a list of accepting states as we match, but instead can precompute all the possible such lists at compile time into one static structure.
Final program:
1: EXACT <f> (3)
3: TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o] (11)
<>
<oa>
<o>
Here you can see an interesting optimization kick in, because all of the alternations start with the letter "f" we "remove" it from the trie, and turn it into a prefix, the EXACT , followed by a trie containing the empty string, the letters "oa" and the letter "o". The S:2/4 means we start on state 2 not on the normal state 1, which is the state that corresponds to the letter 'f'. Extracting the 'f' like this allows it to be used elsewhere in the matching process, which means we can use less expensive machinery the Trie to find possible matches.
11: EVAL (13)
13: EXACT <al> (15)
15: END (0)
anchored "f" at 0 floating "al" at 1..3 (checking floating) minlen 3 with eval
Guessing start of match in sv for REx "(?:f|foa|fo)(?{print $&})al" against "foal"
Found floating substr "al" at offset 2...
Found anchored substr "f" at offset 0...
The 'f' that was extracted earlier is being used to find the first possible place in the string the pattern could match.
Guessed: match at offset 0
Matching REx "(?:f|foa|fo)(?{print $&})al" against "foal"
0 <> <foal> | 1:EXACT <f>(3)
1 <f> <oal> | 3:TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o](11)
1 <f> <oal> | State: 2 Accepted: Y Charid: 2 CP: 6f After State: 3
2 <fo> <al> | State: 3 Accepted: Y Charid: 3 CP: 61 After State: 4
3 <foa> <l> | State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
Here we can see the walk through tree starting at state 2. Each "Accepted" line means we have found a possible match. The last line shows us reading a character not in our alphabet, meaning a transition to state 0 which means we can stop. Unfortunately this does not show the word numbers associated with each match, but you can infer them from the state transition table: 1,3,2, but we need to "try" them 1,2,3.
got 3 possible matches
TRIE matched word #1, continuing
1 <f> <oal> | 11: EVAL(13)
f
1 <f> <oal> | 13: EXACT <al>(15)
failed...
TRIE matched word #2, continuing
3 <foa> <l> | 11: EVAL(13)
foa
3 <foa> <l> | 13: EXACT <al>(15)
failed...
TRIE matched word #3, continuing
only one match left, short-circuiting: #3 <o>
2 <fo> <al> | 11:EVAL(13)
fo
2 <fo> <al> | 13:EXACT <al>(15)
4 <foal> <> | 15:END(0)
Match successful!
Note an explanation of different aspects of the Trie compilation process can be found at:
https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l2407
https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l4726