How does Perl's regex implementation makes use of tries?
Asked Answered
O

2

5

I'm exploring ways to optimize a stack-based backtracking regex implementation (see also this thread). In the comments a hint about Perl's regex matcher was given. From the source code I already figured that it makes use of tries when matching a regex with several alternatives.

For example, a regex like /(?:foo|bar|baz|foobar)$/ would normally yield a program like

BRANCH
  EXACT <foo>
BRANCH
  EXACT <bar>
BRANCH
  EXACT <baz>
BRANCH
  EXACT <foobar>
EOL

while the optimized version using a trie looks somewhat like

TRIE-EXACT
  <foo>
  <bar>
  <baz>
  <foobar>
EOL

The four branches with an EXACT command each were optimized into a single TRIE command.

However, I don't understand how this trie is used in the execution phase with respect to backtracking.

Consider the non-optimized code and the input string foobar. When trying to match the regex, the first branch foo would be successfully matched, the $ would fail, the next branches bar and baz would fail, the last branch foobar would match, $ would match, and the matching completes successfully.

Now how does the matching work with the trie? From my understanding a trie has only one sensible implied order in which overlapping entries are tried, namely the longest match first. This would give a correct match in the aforementioned example.

But it would give the wrong result for /(?:foo|foobar)barr$/ and the input foobarr. The trie would match foobar, but fail on the following r because it expects a b next. So there must not only be a way to backtrack in the trie, but also the original order of branches must be preserved somehow, otherwise the optimized program would be semantically different from the non-optimized version.

How does the Perl implementation handle this tree-based matching?

Osier answered 10/8, 2019 at 5:51 Comment(9)
Good question. I've wondered about this too.Tanah
You have the source code and (presumably) the documentation so are you not in a good position to supply the answer yourself? If the documentation is weak in that area then you could help the Perl community by enhancing the documentation.Khalil
@AdrianHHH, Re "You have the source code and (presumably) the documentation so are you not in a good position to supply the answer yourself?", I spent a fair amount of time on it, and I didn't come up with the answer. The code in question has a lot of documentation, but it doesn't cover this aspect.Tanah
Despite the two close votes, this question is not "too broad". It's actually very specific, and could be answered (at a high level) without difficulty by someone knowledgeable. Please don't close. (By the way, one of the major contributors of Perl's regex engine and maybe even the code in question is a regular here.)Tanah
@Tanah so is apparently the creator of cURL. That doesn't automatically make the question any good.Neanderthal
@Zoe Yeah, but this question is about specific code, not motivations.Siqueiros
@Zoe, Please reread what I said. I never claimed the presence of the author is a reason to keep the question open (as you seem to imply), and noone mentioned software licenses (the subject of the linked Q&A). I merely said the question wasn't too broad --in fact, it's quite specific-- so voting to close it as too broad is wrong.Tanah
@Tanah my point was that the presence of major contributors to the engine alone doesn't make the question any good, making that argument slightly pointless. Also, I never voted to close. It didn't seem too broad, and I figured it could be answered by someone. I'm not arguing against reopening or for closure, I'm arguing against your argument.Neanderthal
@Siqueiros Similar to ^, the comment was never intended as proof this is off-topic. Of course this is about specific code (well, seems more like a question about a specific system to me, but close enough I suppose) and not motivations, it was an argument against the use of a specific argument (probably should've made that clearer)Neanderthal
D
6

I wrote the original implementation of this code. It has changed somewhat since then, which has slightly changed the time/space tradeoffs for how it works.

First, I need to call out one misunderstanding in your post, the rules for Perls regex engine are not to match the longest string, but rather to match the leftmost-longest string, in particular

"foal"=~/(f|fo|foa|foal)/ 

should match "f" not "foal" as the "f" option is first. This is called the "word order" in the comments in the code. You can see this with Perl by injecting (?{ print $& }) into your patterns:

perl -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/'
f
foa
fo 

Another point I thought I should explain is that the code and comments use the term "state" and "accepting state" a lot. This comes from the observation that a Trie is equivalent to an acyclic DFA, and can be naively represented as a table of one row per state, with one column per legal character in the input alphabet and with an additional column being used to track whether the state is "accepting", meaning "ends one of the alternatives", and if so which branch of the alternation it terminates. The values in the other columns represent the new state to transition to after reading the columns character. An accepting state may have transitions to more states, as in the case where one branch matches a prefix of another.

Matching then comes down to walking the tree/state table, when we encounter an accepting state we push a 2-tuple of the number of characters we have read so far in the trie and the word number associated with the state into a list. Eventually we either run out of string to read, or encounter a transition to the 0 state which means we can stop as no further matches are possible.

Once you have a list of the tuples of (length,word) you can then go through them in order of word ascending, since such lists are generally short IMO it is reasonable to simply use linear probing to find the minimum word each try. Perl uses a strategy where it precompiles all such possible lists into one structure in advance, so it need only keep track of the branch number of the most recent accepting state, and then can (somewhat expensively) compute any preceeding accepting states from that. For instance the example below /(f|foa|fo)al/ produces a structure like this:

word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)

Which shows that if we reach the accepting state for "foa" which is branch 2 we have matched a 3 character string and there is a preceding accepting state of branch 3, which in turn points at 1. From this we can reconstruct the behavior we would expect to see, which is to try "al" after "f", "foa" and then "fo". The process is somewhat expensive (N**2) compared to other solutions, but has the advantage of being precompiled and reusable and the storage requirements fit nicely with how the regex backtracking engine works. IMO it doesnt matter, the main issue is that somehow you track the accepting states where you were in the string when you encountered them, and then try the "tail" in the right order.

You can see much of this at work with extended debugging:

perl -Mre=Debug,TRIE,TRIEE,TRIEM -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/' 
Compiling REx "(?:f|foa|fo)(?{print $&})al"
  Looking for TRIE'able sequences. Tail node is: EVAL
  - BRANCH (1) -> EXACT <f> => EVAL (First==-1,Last==-1,Cur==1,tt==END,nt==EXACT,nnt==END)
  - BRANCH (4) -> EXACT <foa>   => EVAL (First==1,Last==-1,Cur==4,tt==EXACT,nt==EXACT,nnt==END)
  - BRANCH (7) -> EXACT <fo>    => EVAL (First==1,Last==4,Cur==7,tt==EXACT,nt==EXACT,nnt==END)
  - TAIL (10) <SCAN FINISHED>
    make_trie start==1, first==1, last==10, tail==11 depth=1
    TRIE(NATIVE): W:3 C:6 Uq:3 Min:1 Max:3
    Compiling trie using table compiler
      Char :    f   o   a
      State+-------------
         1 :    2   .   . (   1)
         2 :    .   3   . (   1) W   1
         3 :    .   .   4 (   1) W   3
         4 :    .   .   . (   0) W   2

Here you can see the DFA/TRIE equivalency and the word data, and how the table is constructed. Because there are only 3 legal characters we build a table that has 4 columns, one per each legal character, and one to track if the state is accepting and if so which branch of the alternation it accepts. The W 2 on row 4 shows that it is accepting for the 2nd branch of the alternation.

    Alloc: 22 Orig: 13 elements, Final:3. Savings of %76.92
    Statecount:5 Lasttrans:4
      Char : Match Base  Ofs     f   o   a
      State|-----------------------------------
      #   1|       @   3 + 0[    2   .   .]
      #   2| W   1 @   3 + 1[    .   3   .]
      #   3| W   3 @   3 + 2[    .   .   4]
      #   4| W   2 @   0 

This shows the naive table representation being packed into a compressed form that allows the zero transition in one row to be used to store non-zero transitions in other rows. Each transition is stored in an array of 2-tuple of [owner-state,transition], with two side arrays, one which stores the start position of a given state, and another that stores the accepting mappings for a given state, state transitions involve checking if table[start_offset[state]+column_ofs].owner_state is the same as state, and if not then table[start_offset[state]+column_ofs].new_state can be assumed to be 0. See the Red Dragon for more details.


    word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)

this part of the output shows how we have precomputed the length, and the predecessor branch number for every accepting state. This way we don't have to construct a list of accepting states as we match, but instead can precompute all the possible such lists at compile time into one static structure.

Final program:
   1: EXACT <f> (3)
   3: TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o] (11)
      <> 
      <oa> 
      <o> 

Here you can see an interesting optimization kick in, because all of the alternations start with the letter "f" we "remove" it from the trie, and turn it into a prefix, the EXACT , followed by a trie containing the empty string, the letters "oa" and the letter "o". The S:2/4 means we start on state 2 not on the normal state 1, which is the state that corresponds to the letter 'f'. Extracting the 'f' like this allows it to be used elsewhere in the matching process, which means we can use less expensive machinery the Trie to find possible matches.

  11: EVAL (13)
  13: EXACT <al> (15)
  15: END (0)
anchored "f" at 0 floating "al" at 1..3 (checking floating) minlen 3 with eval 
Guessing start of match in sv for REx "(?:f|foa|fo)(?{print $&})al" against "foal"
Found floating substr "al" at offset 2...
Found anchored substr "f" at offset 0...

The 'f' that was extracted earlier is being used to find the first possible place in the string the pattern could match.

Guessed: match at offset 0
Matching REx "(?:f|foa|fo)(?{print $&})al" against "foal"
   0 <> <foal>               |  1:EXACT <f>(3)
   1 <f> <oal>               |  3:TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o](11)
   1 <f> <oal>               |    State:    2 Accepted: Y Charid:  2 CP:  6f After State:    3
   2 <fo> <al>               |    State:    3 Accepted: Y Charid:  3 CP:  61 After State:    4
   3 <foa> <l>               |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0

Here we can see the walk through tree starting at state 2. Each "Accepted" line means we have found a possible match. The last line shows us reading a character not in our alphabet, meaning a transition to state 0 which means we can stop. Unfortunately this does not show the word numbers associated with each match, but you can infer them from the state transition table: 1,3,2, but we need to "try" them 1,2,3.

                                  got 3 possible matches
                                  TRIE matched word #1, continuing
   1 <f> <oal>               | 11:  EVAL(13)
f
   1 <f> <oal>               | 13:  EXACT <al>(15)
                                    failed...
                                  TRIE matched word #2, continuing
   3 <foa> <l>               | 11:  EVAL(13)
foa
   3 <foa> <l>               | 13:  EXACT <al>(15)
                                    failed...
                                  TRIE matched word #3, continuing
                                  only one match left, short-circuiting: #3 <o>
   2 <fo> <al>               | 11:EVAL(13)
fo
   2 <fo> <al>               | 13:EXACT <al>(15)
   4 <foal> <>               | 15:END(0)
Match successful!

Note an explanation of different aspects of the Trie compilation process can be found at:

https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l2407

https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l4726

Duax answered 13/8, 2019 at 19:27 Comment(3)
Thanks for the detailed answer!Gaddis
What is the meaning of this output line: BRANCH (7) -> EXACT <fo> => EVAL (First==1,Last==4,Cur==7,tt==EXACT,nt==EXACT,nnt==END) ? For example: What does BRANCH (7) mean? What is First, Last and Cur?Gaddis
The trie optimisation is implemented in a peephole optimizer that runs over the regex optree after the initial structure has been constructed. An alternation is originally constructed as a sequence of BRANCH ... BRANCH ... , when the ... starts with an EXACT (ish) node and there are more than two such sequences in a row the sequence can be converted to a TRIE. What you are seeing there is debug output of the peephole optimiser looking to see if the sequence can be converted. Try: perl -Mre=Debug,ALL -le'qr/(?:a|(??{"x"})|b)/' for debug output of the whole regex compilation process.Duax
R
1

You can get some information by using the re pragma to show how perl compiles and uses the regular expression. For example:

$ perl -Mre=debugcolor -e '"foobarr" =~ /(?:foobar|foo)barr$/' 

gives output:

enter image description here

So it is clear that it compiles the regex into three operations:

  • An EXACT operation which matches foo (this is the common prefix of the alternation (?:foobar|foo), followed by
  • a TRIE-EXACT operation matching bar or the empty string, and then
  • an EXACT operation matching barr.

Then the compiled regex is used in the match:

  • First the EXACT match of foo matches, then
  • the TRIE-EXACT tries the first alternative bar, which matches, then
  • the EXACT operation fails on matching r with barr,
  • the engine backtracks and tries the second word in the TRIE-EXACT operation, which matches the empty string,
  • and finally the last EXACT operation matches barr.

Currently this does not answer your question on how exactly the trie-based matching is implemented, but I plan to investigate further! See also perldoc perlreguts for more information.

Edit: More information on how the trie matching is implemented can be found in regexec.c line #5931:

[...] the basic plan of execution of the trie is: At the beginning, run through all the states, and find the longest-matching word. Also remember the position of the shortest matching word. For example, this pattern:

  1  2 3 4    5
  ab|a|x|abcd|abc

when matched against the string "abcde", will generate accept states for all words except 3, with the longest matching word being 4, and the shortest being 2 (with the position being after char 1 of the string).

Then for each matching word, in word order (i.e. 1,2,4,5), we run the remainder of the pattern; on each try setting the current position to the character following the word, returning to try the next word on failure.

This should explain how the TRIE based matching manages to do the alternations in word-order.

Raylenerayless answered 11/8, 2019 at 9:11 Comment(8)
That doesn't answer the question. The question asks how how can the difference between perl -E'say for "foobar" =~ /(foo|foobar|bar|baz)/' and perl -E'say for "foobar" =~ /(foobar|foo|bar|baz)/' be achieved using tries. -Mre=debug shows is that the trie is used to find all possible leading matches (foo and foodbar in both examples), and that the engine somehow knows that the first of the original words matched. But it doesn't show how does it checks that last, crucial bit.Tanah
@Tanah Ok, when I try this, the first gives foo and the second gives foobar. Is this to be expected?Gaddis
Yes. The question is how is this achieved. Traditionally, all words were tested in order, which could be very slow. But 5.10 introduced a trie-based system to speed things up. But how can a trie be used to achieve this?Tanah
@Tanah I have updated my answer with new information.Gaddis
Re "* will generate accept states*", This is a long way of saying it will find all possible matches from the trie, as I mentioned it does. /// Re "Then for each matching word, in word order (i.e. 1,2,4,5)", We know that's what it does. The code I provided showed it does. The question is how is this done?Tanah
Put differently, the trie returns a list of matching words sorted by length (1,2 for first example, and 2,1 for second example). How is that list turned into a list of matching words sorted by word order (1,2 in both examples)Tanah
From the screenshot and the documentation excerpt I think I got the basic idea of how the matching is done. This is already very useful! The approach seems to be: find the longest match in the trie first to limit the number of words to consider when backtracking (steps 3-6 in the screenshot). Only then begin actually matching all alternatives that lie on that path. The missing detail is how it stores the order of words in the trie and quickly finds the next to consider on backtracking.Osier
So it seems it's just traversing the whole path on each backtracking step to find the next word in the list. In one comment it says "With lots of accepts states, and having to try all of them, it becomes quadratic on number of accept states to find all the next words." It's probably not worth optimizing that away directly in the compilation phase, as a lot of nested prefixes in one group should be relatively rare.Osier

© 2022 - 2024 — McMap. All rights reserved.