Your expressions are not equivalent
This:
$string=~s/^.+\///;
$string=~s/\.shtml//;
replaces the text .shtml
and everything up to and including the last slash.
This:
$string=~s/(^.+\/|\.shtml)//;
replaces either the text .shtml
or everything up to and including the last slash.
This is one problem with combining regexes: a single complex regex is harder to write, harder to understand, and harder to debug than several simple ones.
It probably doesn't matter which is faster
Even if your expressions were equivalent, using one or the other probably wouldn't have a significant impact on your program's speed. In-memory operations like s///
are significantly faster than file I/O, and you've indicated that you're doing a lot of file I/O.
You should profile your application with something like Devel::NYTProf to see if these particular substitutions are actually a bottleneck (I doubt they are). Don't waste your time optimizing things that are already fast.
Alternations hinder the optimizer
Keep in mind that you're comparing apples and oranges, but if you're still curious about performance, you can see how perl evaluates a particular regex using the re
pragma:
$ perl -Mre=debug -e'$_ = "foobar"; s/^.+\///; s/\.shtml//;'
...
Guessing start of match in sv for REx "^.+/" against "foobar"
Did not find floating substr "/"...
Match rejected by optimizer
Guessing start of match in sv for REx "\.shtml" against "foobar"
Did not find anchored substr ".shtml"...
Match rejected by optimizer
Freeing REx: "^.+/"
Freeing REx: "\.shtml"
The regex engine has an optimizer. The optimizer searches for substrings that must appear in the target string; if these substrings can't be found, the match fails immediately, without checking the other parts of the regex.
With /^.+\//
, the optimizer knows that $string
must contain at least one slash in order to match; when it finds no slashes, it rejects the match immediately without invoking the full regex engine. A similar optimization occurs with /\.shtml/
.
Here's what perl does with the combined regex:
$ perl -Mre=debug -e'$_ = "foobar"; s/(?:^.+\/|\.shtml)//;'
...
Matching REx "(?:^.+/|\.shtml)" against "foobar"
0 <> <foobar> | 1:BRANCH(7)
0 <> <foobar> | 2: BOL(3)
0 <> <foobar> | 3: PLUS(5)
REG_ANY can match 6 times out of 2147483647...
failed...
0 <> <foobar> | 7:BRANCH(11)
0 <> <foobar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
1 <f> <oobar> | 1:BRANCH(7)
1 <f> <oobar> | 2: BOL(3)
failed...
1 <f> <oobar> | 7:BRANCH(11)
1 <f> <oobar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
2 <fo> <obar> | 1:BRANCH(7)
2 <fo> <obar> | 2: BOL(3)
failed...
2 <fo> <obar> | 7:BRANCH(11)
2 <fo> <obar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
3 <foo> <bar> | 1:BRANCH(7)
3 <foo> <bar> | 2: BOL(3)
failed...
3 <foo> <bar> | 7:BRANCH(11)
3 <foo> <bar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
4 <foob> <ar> | 1:BRANCH(7)
4 <foob> <ar> | 2: BOL(3)
failed...
4 <foob> <ar> | 7:BRANCH(11)
4 <foob> <ar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
5 <fooba> <r> | 1:BRANCH(7)
5 <fooba> <r> | 2: BOL(3)
failed...
5 <fooba> <r> | 7:BRANCH(11)
5 <fooba> <r> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
Match failed
Freeing REx: "(?:^.+/|\.shtml)"
Notice how much longer the output is. Because of the alternation, the optimizer doesn't kick in and the full regex engine is executed. In the worst case (no matches), each part of the alternation is tested against each character in the string. This is not very efficient.
So, alternations are slower, right? No, because...
It depends on your data
Again, we're comparing apples and oranges, but with:
$string = 'a/really_long_string';
the combined regex may actually be faster because with s/\.shtml//
, the optimizer has to scan most of the string before rejecting the match, while the combined regex matches quickly.
You can benchmark this for fun, but it's essentially meaningless since you're comparing different things.
foo/bar.shtml
intobar
; your version turns it intobar.shtml
– Diesel