Is it faster to use alternation than subsequent replacements in regular expressions
Asked Answered
C

6

7

I have quite a straightforward question. Where I work I see a lot of regular expressions come by. They are used in Perl to get replace and/or get rid of some strings in text, e.g.:

$string=~s/^.+\///;
$string=~s/\.shtml//;
$string=~s/^ph//;

I understand that you cannot concatenate the first and last replacement, because you may only want to replace ph at the beginning of the string after you did the first replacement. However, I would put the first and second regex together with alternation: $string=~s/(^.+\/|\.shtml)//; Because we're processing thousands of files (+500,000) I was wondering which method is the most efficient.

Camenae answered 5/4, 2016 at 8:6 Comment(7)
No, do not omit language tag, it is crucial.Salmanazar
@WiktorStribiżew Why is that, if this question concerns all types of regex?Camenae
Regex implementations across languages are VERY different. Using lazy dot matching in .NET and PCRE will lead to different amount of performance drop. Lookbehinds with limiting quantifiers in Java and .NET differ in the way the string and pattern are analyzed. So, you need Perl experts' advice.Salmanazar
Thanks for the explanation.Camenae
It does in fact depend on data a lot. I think that it may be fair to say that in general you cannot make a call. Extreme examples. If the pattern has a lot of fixed strings it may be able to match an alternation in one pass, or at least in not many. If you have fixed patterns at start and end separate ones are better. If the pattern is complicated when you break it up you may be able to rewrite the (sub)patterns so to minimize backtracking (the performance killer). If you can narrow down your data spec that'd be different. But those micro-optimizations are costly, of course.Outbreak
Your combined regex is not equivalent: the original code turns foo/bar.shtml into bar; your version turns it into bar.shtmlDiesel
In general I'm going to say that it's faster to prioritize the expressions in an alternation. The behind the scenes of all replacements are that a new string is created each time a regex is run. Is it faster to create just a single new string or create three new strings? Probably faster to create a single new one. There is also the consideration that by doing it once, the same territory is not rechecked.Mimamsa
D
10

Your expressions are not equivalent

This:

$string=~s/^.+\///;
$string=~s/\.shtml//;

replaces the text .shtml and everything up to and including the last slash.

This:

$string=~s/(^.+\/|\.shtml)//;

replaces either the text .shtml or everything up to and including the last slash.

This is one problem with combining regexes: a single complex regex is harder to write, harder to understand, and harder to debug than several simple ones.

It probably doesn't matter which is faster

Even if your expressions were equivalent, using one or the other probably wouldn't have a significant impact on your program's speed. In-memory operations like s/// are significantly faster than file I/O, and you've indicated that you're doing a lot of file I/O.

You should profile your application with something like Devel::NYTProf to see if these particular substitutions are actually a bottleneck (I doubt they are). Don't waste your time optimizing things that are already fast.

Alternations hinder the optimizer

Keep in mind that you're comparing apples and oranges, but if you're still curious about performance, you can see how perl evaluates a particular regex using the re pragma:

$ perl -Mre=debug -e'$_ = "foobar"; s/^.+\///; s/\.shtml//;'
...
Guessing start of match in sv for REx "^.+/" against "foobar"
Did not find floating substr "/"...
Match rejected by optimizer
Guessing start of match in sv for REx "\.shtml" against "foobar"
Did not find anchored substr ".shtml"...
Match rejected by optimizer
Freeing REx: "^.+/"
Freeing REx: "\.shtml"

The regex engine has an optimizer. The optimizer searches for substrings that must appear in the target string; if these substrings can't be found, the match fails immediately, without checking the other parts of the regex.

With /^.+\//, the optimizer knows that $string must contain at least one slash in order to match; when it finds no slashes, it rejects the match immediately without invoking the full regex engine. A similar optimization occurs with /\.shtml/.

Here's what perl does with the combined regex:

$ perl -Mre=debug -e'$_ = "foobar"; s/(?:^.+\/|\.shtml)//;'
...
Matching REx "(?:^.+/|\.shtml)" against "foobar"
   0 <> <foobar>             |  1:BRANCH(7)
   0 <> <foobar>             |  2:  BOL(3)
   0 <> <foobar>             |  3:  PLUS(5)
                                    REG_ANY can match 6 times out of 2147483647...
                                    failed...
   0 <> <foobar>             |  7:BRANCH(11)
   0 <> <foobar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   1 <f> <oobar>             |  1:BRANCH(7)
   1 <f> <oobar>             |  2:  BOL(3)
                                    failed...
   1 <f> <oobar>             |  7:BRANCH(11)
   1 <f> <oobar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   2 <fo> <obar>             |  1:BRANCH(7)
   2 <fo> <obar>             |  2:  BOL(3)
                                    failed...
   2 <fo> <obar>             |  7:BRANCH(11)
   2 <fo> <obar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   3 <foo> <bar>             |  1:BRANCH(7)
   3 <foo> <bar>             |  2:  BOL(3)
                                    failed...
   3 <foo> <bar>             |  7:BRANCH(11)
   3 <foo> <bar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   4 <foob> <ar>             |  1:BRANCH(7)
   4 <foob> <ar>             |  2:  BOL(3)
                                    failed...
   4 <foob> <ar>             |  7:BRANCH(11)
   4 <foob> <ar>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
   5 <fooba> <r>             |  1:BRANCH(7)
   5 <fooba> <r>             |  2:  BOL(3)
                                    failed...
   5 <fooba> <r>             |  7:BRANCH(11)
   5 <fooba> <r>             |  8:  EXACT <.shtml>(12)
                                    failed...
                                  BRANCH failed...
Match failed
Freeing REx: "(?:^.+/|\.shtml)"

Notice how much longer the output is. Because of the alternation, the optimizer doesn't kick in and the full regex engine is executed. In the worst case (no matches), each part of the alternation is tested against each character in the string. This is not very efficient.

So, alternations are slower, right? No, because...

It depends on your data

Again, we're comparing apples and oranges, but with:

$string = 'a/really_long_string';

the combined regex may actually be faster because with s/\.shtml//, the optimizer has to scan most of the string before rejecting the match, while the combined regex matches quickly.

You can benchmark this for fun, but it's essentially meaningless since you're comparing different things.

Diesel answered 13/4, 2016 at 23:5 Comment(0)
Y
3

How regex alternation is implemented in Perl is fairly well explained in perldoc perlre

Matching this or that

We can match different character strings with the alternation metacharacter '|' . To match dog or cat , we form the regex dog|cat . As before, Perl will try to match the regex at the earliest possible point in the string. At each character position, Perl will first try to match the first alternative, dog . If dog doesn't match, Perl will then try the next alternative, cat . If cat doesn't match either, then the match fails and Perl moves to the next position in the string. Some examples:

"cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
"cats and dogs" =~ /dog|cat|bird/;  # matches "cat" 

Even though dog is the first alternative in the second regex, cat is able to match earlier in the string.

"cats"          =~ /c|ca|cat|cats/; # matches "c"
"cats"          =~ /cats|cat|ca|c/; # matches "cats" 

Here, all the alternatives match at the first string position, so the first alternative is the one that matches. If some of the alternatives are truncations of the others, put the longest ones first to give them a chance to match.

"cab" =~ /a|b|c/ # matches "c"
                 # /a|b|c/ == /[abc]/ 

The last example points out that character classes are like alternations of characters. At a given character position, the first alternative that allows the regexp match to succeed will be the one that matches.

So this should explain the price you pay when using alternations in regex.

When putting simple regex together, you don't pay such a price. It's well explained in another related question in SO. When directly searching for a constant string, or a set of characters as in the question, optimizations can be done and no backtracking is needed which means potentially faster code.

When defining the regex alternations, just choosing a good order (putting the most common findings first) can influence the performance. It is not the same either to choose between two options, or twenty. As always, premature optimization is the root of all evil and you should instrumentiate you code (Devel::NYTProf) if there are problems or you want improvements. But as a general rule alternations should be kept to a minimum and avoided if possible since:

  • They easily make the regex too big an complex. We like simple, easy to understand / debug / maintain regex.
  • Variability and input dependant. They could be an unexpected source of problems since they backtrack and can lead to unexpected lack of performance depending on your input. As I understand, there's no case when they will be faster.
  • Conceptually you are trying to match two different things, so we could argue that two different statements are more correct and clear than just one.

Hope this answer gets closer to what you were expecting.

Yang answered 13/4, 2016 at 20:10 Comment(1)
The optimizer kicks in for both $string=~s/^.+\///; and $string=~s/\.shtml//; and bails out immediately if $string contains no slashes or no .shtml, respectively. This is fast because execution never even makes it into the full regex engine.Diesel
T
1

First, measure the various options on your real data, because no amount of theory will beat an experiment (if one can be done). There are many timing modules on CPAN that will help you.

Second, if you decide to optimize the regexes, do not squish them into one giant monster by hand, try to assemble the “master” regex with code. Otherwise no-one will be able to decipher the code.

Tummy answered 5/4, 2016 at 8:15 Comment(2)
Not trying to be rude, but this seems more like a comment than an answer to the question whether alternation is faster than following replacements. I know I can test it myself, but I don't have the time or resources to set up an experiment. I had hoped there was a sort-of clear-cut answer to alternation vs. multiple replacements.Camenae
I don’t think you can get a better answer, that’s why I didn’t submit this as a comment. Regex performance is tricky because the regex engine is heavily optimized for various special cases. Trivial changes can often lead to big differences in performance. The general case of the question is almost impossible to answer. As for the time or resources required to set up an experiment, see the timing modules on CPAN, it should be trivial to decide your particular case.Tummy
B
1

Combination is not your best option

If you have three regexes that work perfectly well, there's no benefit to combine them. Not only does rewriting them open the door for bugs, it makes it harder for both programmer and engine to read the regex.

This page suggests that instead alteration like:

while (<FILE>) {
    next if (/^(?:select|update|drop|insert|alter)\b/);     
    ...  
}

You should use:

while (<FILE>) {
    next if (/^select/);
    next if (/^update/);
    ...
}

You can further optimize your regexes

You can use regex objects, which will ensure that your regex is not being recompiled in a loop:

my $query = qr/foo$bar/; #compiles here
@matches = (  );
...
while (<FILE>) {
    push @matches, $_ if /$query/i;
}

You may also be able to optimize the .+. It will eat up the entire file, and then has to backtrack character by character until it finds a / so it can match. If there is only one / per file, try a negated character class like: [^/] (escaped: [^\/]). Where do you expect to find the / in your file? Knowing that will allow your regex to become faster.

The speed of replacement depends on other factors

If you have performance issues (currently, with the 3 regexes), it could be a different part of your program. While the processing speed of computers has grown exponentially, read and write speed has experienced little growth.

There may be faster engines for search and replacing text in a file

Perl uses NFA, which is slower yet more powerful than the DFA engine sed has. NFA backtracks (especially with alterations) and has a worst-case exponential run time. DFA has linear execution time. Your patterns do not need an NFA engine, so you can use your regexes in a DFA engine, like sed, very easily.

According to here sed can do search and replace at a speed of 82.1 million characters processed per second (note that this test was writing to /dev/null, so hard disk write speed wasn't really a factor).

Bezel answered 14/4, 2016 at 17:39 Comment(3)
From perlop: The bottom line is that using /o is almost never a good idea. From perlre: o - pretend to optimize your code, but actually introduce bugsArdrey
In Perl 5.6+, /$query/ is only re-compiled if $query changes, so the 'o' modifier is unnecessary (and introduces side effects). This happens even without using qr.Diesel
@Diesel I removed the o modifier part. I mention precompiling, instead. In many flavors, precompiling WILL have a definite benefit. In this case, precompiling will either do nothing, or provide a benefit, so it's probably worth doing because it's so easy to add.Bezel
P
0

A bit off topic maybe, but if the actual replacements are rare, relative to number of comapares (10%-20%?), you may gain some speed with using an index match first

$string=~s/\.shtml//
    if index($string, ".shtml");
Prewar answered 16/4, 2016 at 10:50 Comment(0)
S
-2

Second method is best in which you put first and second regex together with alternation. Because in that method the perl do traverse once, and check both the expressions.

If you use first method in which perl have to traverse separate for both expressions.

Hence Number of loops decreased in Second method.

Significs answered 5/4, 2016 at 9:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.