substring match faster with regular expression?
Asked Answered
D

3

8

After having read up on RE/NFA and DFA, it seems that finding a substring within a string might actually be asymptotically faster using an RE rather than a brute force O(mn) find. My reasoning is that a DFA would actually maintain state and avoid processing each character in the "haystack" more than once. Hence, searches in long strings may actually be much faster if done with regular expressions.

Of course, this is valid only for RE matchers that convert from NFA to DFA.

Has anyone experienced better string match performance in real life when using RE rather than a brute force matcher?

Decapitate answered 21/7, 2010 at 20:5 Comment(0)
P
3

First of all, I would recommend you read the article about internals of regular expressions in several languages: Regular Expression Matching Can Be Simple And Fast.

Because regexps in many languages are not just for matching, but also provide possibility of group-capturing and back-referencing, almost all implementations use so called "backtracking" when execute NFA built from the given regexp. And this implementation has exponential time complexity (in worst case).

There could be RE implementation through the DFA (with group capturing), but it has an overhead (see Laurikari's paper NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions).

For simple substring searching you could use Knuth-Morris-Pratt algorithm, which build DFA to search substring, and it has optimal O(len(s)) complexity. But it hase overhead also, and if you test naive approach (O(nm)) against this optimal algorithm on real-world words and phrases (which are not so repetitive), you could find that naive approach is better in average.

For exact substring searching you could also try Boyer–Moore algo, which has O(mn) worst-case complexity, but work better than KMP in average on real-world data.

Propaedeutic answered 27/7, 2010 at 8:8 Comment(1)
Boyer-Moore is O(n); it never requires more than 3n comparisons. The simpler Boyer-Moore-Horspool can require up to mn, but that's not the "same" algorithm.Workingman
A
2

Most regular expressions used in practice are PCRE (Perl-Compatible Regular Expressions), which are wider than regular language and thus cannot be expressed with a regular grammar. PCRE has things like positive/negative lookahead/lookbehind assertions and even recursion, so parsing may require processing some characters more than once. Surely, it all comes down to particular RE implementation: whether it is optimized if the expressions stays within bounds of regular grammar or not.

Personally, I haven't done any sort of performance comparisons between the two. However, in my experience I never ever had performance issues with brute force find-and-replace, while I had to deal with RE performance bottlenecks on more than one occasion.

Apostil answered 21/7, 2010 at 20:21 Comment(1)
Yes, agreed. I was more concerned with matching a substring with an RE like such (/someString/). And yes, it would have to be in C cause I guess that is the only language whose RE engine converts to DFA.Decapitate
E
1

If you look at documentation for most languages it will mention that if you dont need to power of regex you should use the non-regex version for performance reasons... Example: http://www.php.net/manual/en/function.preg-split.php states: "If you don't need the power of regular expressions, you can choose faster (albeit simpler) alternatives like explode() or str_split()."

This is a trade off that exists everywhere. That is the more flexible and feature rich a solution is the poorer its performance.

Eous answered 21/7, 2010 at 20:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.