I'm adding this second answer in response to the new information you added since I posted the first one. My goal there was to help you restore your system to its previous state, when the regexes were working. I tend to agree with the commenter on that page I linked to, who said the default setting was overly conservative. So I stand by that answer, but I don't want anyone to think they can solve all their regex problems by throwing more memory at them.
Now that I've seen your real-world regexes, I have to say you have another problem. I tested that third regex against the page you linked to in RegexBuddy and these are the results I got:
(?ims)<tr.*?>.*?<b>.*?first science area of study.*?</b>.*?</tr>.*?<tr.*?>.*?<td.*?>.*?<b>(.*?) \((.*?)\).*?</b>(.*?credits.*?)</td>.*?<td.*?>(.*?<a .*?)</td>.*?</tr>
course name start end steps
Match #1 (Comp. Sci.) 10 275 31271
Match #2 (Bio & Chem) 276 341 6986
Match #3 (Enviro) 342 379 5944
Match #4 (Genetics) 386 416 4463
Match #5 (Chem) 417 455 5074
Match #6 (Math) 495 546 15610
Match #7 (Phys & Astro) 547 593 8617
Match #8 (no match) gave up after 1,000,000 steps
You've probably heard many people say that non-greedy regexes always return the shortest possible match, so why did this one return a first match that's 200 lines longer than any of the others? You may have heard that they're more efficient because they don't backtrack as much, so why did this one take over 30,000 steps to complete the first match, and why did it effectively lock up on the final attempt when no match was possible?
First off, there's no such thing as a greedy or non-greedy regex. Only individual quantifiers can be described that way. A regex in which every quantifier is greedy won't necessarily return the longest possible match, and the name "non-greedy regex" is even less accurate. Greedy or non-greedy, the regex engine always starts trying to match at the earliest opportunity, and it doesn't give up on a starting position until every possible path from it has been explored.
Non-greedy quantifiers are only a convenience; there's nothing magical about them. It's still up to you, the regex author, to guide the regex engine to a correct and efficient match. Your regex may returning the correct results, but it's wasting a hell of a lot of effort in the process. It consumes a lot of characters that it doesn't need to at the beginning, it thrashes about endlessly examining the same characters over and over, and it takes way too long to figure out when the path it's on can't lead to a match.
Now check out this regex:
(?is)<tr[^<]*(?:<(?!/tr>|b>)[^<]*)*<b>\s*first science area of study\s*</b>.*?</tr>.*?<tr.*?>.*?<td.*?>.*?<b>(.*?) \((.*?)\).*?</b>(.*?credits.*?)</td>.*?<td.*?>(.*?<a .*?)</td>.*?</tr>
course name start end steps
Match #1 (Comp. Sci.) 209 275 9891
Match #2 (Bio & Chem) 276 341 5389
Match #3 (Enviro) 342 379 5833
Match #4 (Genetics) 386 416 4222
Match #5 (Chem) 417 455 4961
Match #6 (Math) 495 546 9899
Match #7 (Phys & Astro) 547 593 8506
Match #8 (no match) reported failure in 139 steps
After the first </b>
, everything is as you wrote it. The effect of my changes is that it doesn't start matching in earnest until it finds the <TR>
element that contains the first <B>
tag we're interested in:
<tr[^<]*(?:<(?!/tr>|b>)[^<]*)*<b>\s*first science area of study\s*</b>
This part spends most of its time greedily consuming characters with [^<]*
, which is significantly faster, character for character, than a non-greedy .*?
. But far more important is that it takes practically no time to figure out when no more matches are possible. If there's a Golden Rule of regex performance it's this: when a match attempt is going to fail, it should fail as quickly as possible.
i
in there at the end after the delimiter... that makes it case insensitive – Themselvess
make.
match new lines, etc. as well as spaces? (Andm
makes it multi-line too). Correct me if I'm wrong... – Themselvesfalse
. Are you sure that you copied the exact string? anything about encoding? – Abrasive