Why is std::regex notoriously much slower than other regular expression libraries?
Asked Answered
S

1

25

This Github repository added std::regex to the list of regular expression engines and got decimated by the others.

Why is that std::regex - as implemented in libstdc++ - so much slower than others? Is that because of the C++ standard requirements or it is just that that particular implementation is not very well optimized?

Also in the shootout std::regex was unable to compile several regular expressions that all the others accepted, even after adding the flag std::regex::extended. They were (?i)Twain, \b\w+nn\b, (?i)Tom|Sawyer|Huckleberry|Finn, \s[a-zA-Z]{0,12}ing\s, ([A-Za-z]awyer|[A-Za-z]inn)\s and \p{Sm}.

UPDATE: Added comparison with boost::regex.

UPDATE2: added ctre

enter image description here

Stegosaur answered 4/1, 2022 at 18:22 Comment(14)
Most likely due to generality and support for each target architecture by the standard implementation and/or carelessness. Try to upgrade your compiler or toolchain to maybe see improvements.Suffragist
@Suffragist I ran that on Ubuntu 20.04 g++ 9.3.0 which is about 1 year old. I don't see any major changes in the changelogs since than relative to std::regex. gcc.gnu.org/gcc-11/changes.html and gcc.gnu.org/gcc-10/changes.htmlStegosaur
Let's just say that there are no performance requirements for the std::regex. The std::regex requirements are more along the lines of correct operation than speed.Undersexed
In the shootout several expressions threw exceptions (not able to compile) as (?i)Twain, \b\w+nn\b, (?i)Tom|Sawyer|Huckleberry|Finn, \s[a-zA-Z]{0,12}ing\s, ([A-Za-z]awyer|[A-Za-z]inn)\s and \p{Sm}. Should they?Stegosaur
Anything with (?i) is going to fail, because it's not part of the extended posix syntax. I'm not sure about the {Sm}. You should look at the exception being thrown during the construction of the regex object, which couiild give you more info.Diazo
Another possibility is that it is optimised for other use cases. You can parse regex with DFAs or NFAs, the construction of a DFA takes more time, but there are regular expressions which can be paraes significantly faster with a DFA. You should try such a special case as an additional test case.Aegisthus
Is this benchmark testing against a lot or a little text? i.e. how much matching is regex construction amortized over? And I assume you enabled full optimization, like g++ -O3.Microhenry
@PeterCordes The test file is 16MiB with 300k lines and 2.7M words. This fits entirely on the machine's L3 cache. Yes -O3 in place. The construction of the regex is not included in the benchmark, only the matching part. If you look at the tests, it's a microbenchmark. The regex object is constructed outside the loop.Stegosaur
@Aegisthus Do you have any suggestion in mind?Stegosaur
@ThomasMatthews I'm surprised that anything in the STL can be built without performance in mind. C++ is supposed to be a performance-centered language.Stegosaur
@PeterCordes The test file is also checked in with the code: github.com/HFTrader/regex-performance/blob/master/3200.txtStegosaur
@jellyboy I looked it up, it was the unix regular expression "(.*?,){13}z" which a full match of strings of the type "1,2,3,4,5,6,7,8,9,10,11,12(,1)^i" which numbers of repetitions for the trailing ones. But according to my document C++ belong to the slow implementations, not the fast ones(like egrep).Aegisthus
@Aegisthus Boost threw an exception, std::regex took 16 SECONDS, re2 and rust about 20 milliseconds.Stegosaur
Can someone document where this table came from and the units? Seconds? Milliseconds?Pontone
L
25

Is that because of the C++ standard requirements or it is just that that particular implementation is not very well optimized?

The answer is yes. Kinda.

There is no question that libstdc++'s implementation of <regex> is not well optimized. But there is more to it than that. It's not that the standard requirements inhibit optimizations so much as the standard requirements inhibit changes.

The regex library is defined through a bunch of templates. This allows people to choose between char and wchar_t, which is in theory good. But there's a catch.

Template libraries are used by copy-and-pasting the code directly into the code compiling against those libraries. Because of how templates get included, even types that nobody outside of the template library knows about are effectively part of the library's ABI. If you change them, two libraries compiled against different versions of the standard library cannot work with each other. And because the template parameter for regex is its character type, those implementation details touch basically everything about the implementation.

The minute libstdc++ (and other standard library implementations) started shipping an implementation of C++ regular expressions, they bound themselves to a specific implementation that could not be changed in a way that impacted the ABI of the library. And while they could cause another ABI break to fix it, standard library implementers don't like breaking ABI because people don't upgrade to standard libraries that break their code.

When C++11 forbade basic_string copy-on-write implementations, libstdc++ had an ABI problem. Their COW string was widely used, and changing it would make code that compiled against the new one break when used with code compiled against the old one. It took years before libstdc++ bit the bullet and actually implemented C++11 strings.

If Regex had been defined without templates, implementations could use traditional mechanisms to hide implementation details. The ABI for the interface to external code could be fixed and unchanging, with only the implementation of the functions behind that ABI changing from version to version.

Lima answered 5/1, 2022 at 4:11 Comment(1)
Great answer. In theory, the std::basic_regex<char> specialization could have been explicit instantiated in the shared library, or could have been explicitly specialized to use a custom, non-inline implementation inside the shared library. That would have allowed some changes/improvements/optimizations to be made without breaking ABI. But that wasn't done, and it's too late to do it now, because now it would be an ABI break. We had to break the string ABI because everybody uses it. Going through that pain again for std::regex doesn't seem worth it - just use a better regex implementation.Boxhaul

© 2022 - 2024 — McMap. All rights reserved.