Why would this regex return an error?
Asked Answered
T

2

0

Why does the following evaluate to true?

if(preg_match_all('%<tr.*?>.*?<b>.*?</b>.*?</tr>%ims', $contents, $x)===FALSE)
{...}

$contents, is retrieved using file_get_contents() from this source.


The regex was simplified to troublshoot the problem. The code I was actually using was:

if(preg_match(
           '%Areas of Study: </P>.*?<TABLE BORDER="0">(.*?)<TBODY>.*?</TBODY>.*?   </TABLE>%ims',
            $contents, $course_list)
    )
    {
        if(preg_match_all('%<TR>.*?<TD.*?>.*?<B>(.*?)</B>.*?</TD>.*?<TD.*?>.*?</TD>.*?<TD.*?>.*?<B>(.*?)</B>.*?</TD>.*?</TR>%ims',
                $course_list[0], $course_titles)
        )
        {
            ...
        }
        else 
        {
            die('<p>ERROR: first preg_match_all fails</p>');
        }

        echo '<p>INFO: Courses  found</p>';
    }
    else
    {
        die('<p>ERROR: Courses not found</p>');
    }

    if(
        preg_match_all('%<tr.*?>.*?<b>.*?first '.$college.' area of study.*?</b>.*?</tr>.*?<tr.*?>.*?<td.*?>.*?<b>(.*?) \((.*?)\).*?</b>(.*?credits.*?)</td>.*?<td.*?>(.*?<a .*?)</td>.*?</tr>%ims',
        $contents, $course_modules))
    {
        ....
    }
    else
    {
        die('<p>ERROR: Courses details/streams not found</p>');
    }

I always get:

INFO: Courses found
ERROR: Course detail/streams not found

It's strange how the other regex function calls seem to work but not the last one.


Note:

This regex previously worked (it was actually more complex). I'm not sure if this matters but but I updated my version of WAMP (therefore my php.ini, etc. was reset) and I messed around with my setup while troubleshooting a MongoDB connection problem last week.

Themselves answered 2/2, 2012 at 23:32 Comment(12)
Because you are using it on HTML #1732848 Even if you find a clever solution this time, it will fail tomorrow when you update your ___. FWIW, could it be related to case sensitivity?Lettered
@Lettered I took an entire module on parsing HTML with regex :O the head our CS dept taught the module too! Does it matter that the source isn't xHTML? Source: ucc.ie/calendar/science/sci002.htmlThemselves
@Lettered I have i in there at the end after the delimiter... that makes it case insensitiveThemselves
looks like it should work to me. Just a shot in the dark, but maybe try changing your delimiters (# instead of %).Summer
@AdamLynch I'd direct your CS dept head to the linked answer and subsequent blog post from Jeff AtwoodNazarius
@Nazarius It's too late for me... think of the children? I want to go back in time and save myself!Themselves
Bonus points to anyone who can explain this in @Lettered 's first comment: "Even if you find a clever solution this time, it will fail tomorrow when you update your ___" Even better, explain why it's a sequence of underscores and not words!Themselves
You're not matching new lines, could this be the reason?Deafening
@Deafening doesn't s make . match new lines, etc. as well as spaces? (And m makes it multi-line too). Correct me if I'm wrong...Themselves
Tested locally - regexp shows match and returns true, making the whole condition false. Are you sure that you copied the exact string? anything about encoding?Abrasive
This time it failed when you updated your WAMP, next time it could be when you update html or something else - the blank can be filled with many things. Parsing html with regex is prone to fail easily - it is hard to maintainLettered
In an attempt to change the course for the better (for the kids), I've posted a question to get some proof to back up the claim that HTML parsing should not be done with RegEx.Themselves
A
1

You might check your pcre.backtrack_limit setting. It would have to be ridiculously low to prevent that regex from matching that input, but you did say you'd been messing around with the setup...

You can try testing it by changing the regex. When I tested it in RegexBuddy, your regex matched that input in 1216 steps. When I changed it to this:

'%<tr.*?>.*?<b>.*?</b>[^<]*(?:<(?!/?tr\b)[^<]*)*</tr>%ims'

...it only took 441 steps.

Apprehend answered 3/2, 2012 at 7:47 Comment(2)
That regex also returned false. I found the ;pcre.backtrack_limit=100000 in my php.ini and changed it to pcre.backtrack_limit=100000 but it made no difference eitherThemselves
Ok it turns it out uncommenting the backtrack_limit line and adding three zeros sorted it; i.e. pcre.backtrack_limit=100000000Themselves
A
3

I'm adding this second answer in response to the new information you added since I posted the first one. My goal there was to help you restore your system to its previous state, when the regexes were working. I tend to agree with the commenter on that page I linked to, who said the default setting was overly conservative. So I stand by that answer, but I don't want anyone to think they can solve all their regex problems by throwing more memory at them.

Now that I've seen your real-world regexes, I have to say you have another problem. I tested that third regex against the page you linked to in RegexBuddy and these are the results I got:

(?ims)<tr.*?>.*?<b>.*?first science area of study.*?</b>.*?</tr>.*?<tr.*?>.*?<td.*?>.*?<b>(.*?) \((.*?)\).*?</b>(.*?credits.*?)</td>.*?<td.*?>(.*?<a .*?)</td>.*?</tr>

          course name       start      end    steps
Match #1  (Comp. Sci.)        10       275    31271
Match #2  (Bio & Chem)       276       341     6986
Match #3  (Enviro)           342       379     5944
Match #4  (Genetics)         386       416     4463
Match #5  (Chem)             417       455     5074
Match #6  (Math)             495       546    15610
Match #7  (Phys & Astro)     547       593     8617
Match #8  (no match)        gave up after 1,000,000 steps

You've probably heard many people say that non-greedy regexes always return the shortest possible match, so why did this one return a first match that's 200 lines longer than any of the others? You may have heard that they're more efficient because they don't backtrack as much, so why did this one take over 30,000 steps to complete the first match, and why did it effectively lock up on the final attempt when no match was possible?

First off, there's no such thing as a greedy or non-greedy regex. Only individual quantifiers can be described that way. A regex in which every quantifier is greedy won't necessarily return the longest possible match, and the name "non-greedy regex" is even less accurate. Greedy or non-greedy, the regex engine always starts trying to match at the earliest opportunity, and it doesn't give up on a starting position until every possible path from it has been explored.

Non-greedy quantifiers are only a convenience; there's nothing magical about them. It's still up to you, the regex author, to guide the regex engine to a correct and efficient match. Your regex may returning the correct results, but it's wasting a hell of a lot of effort in the process. It consumes a lot of characters that it doesn't need to at the beginning, it thrashes about endlessly examining the same characters over and over, and it takes way too long to figure out when the path it's on can't lead to a match.

Now check out this regex:

(?is)<tr[^<]*(?:<(?!/tr>|b>)[^<]*)*<b>\s*first science area of study\s*</b>.*?</tr>.*?<tr.*?>.*?<td.*?>.*?<b>(.*?) \((.*?)\).*?</b>(.*?credits.*?)</td>.*?<td.*?>(.*?<a .*?)</td>.*?</tr>

          course name       start      end    steps
Match #1  (Comp. Sci.)       209       275     9891
Match #2  (Bio & Chem)       276       341     5389
Match #3  (Enviro)           342       379     5833
Match #4  (Genetics)         386       416     4222
Match #5  (Chem)             417       455     4961
Match #6  (Math)             495       546     9899
Match #7  (Phys & Astro)     547       593     8506
Match #8  (no match)        reported failure in 139 steps

After the first </b>, everything is as you wrote it. The effect of my changes is that it doesn't start matching in earnest until it finds the <TR> element that contains the first <B> tag we're interested in:

<tr[^<]*(?:<(?!/tr>|b>)[^<]*)*<b>\s*first science area of study\s*</b>

This part spends most of its time greedily consuming characters with [^<]*, which is significantly faster, character for character, than a non-greedy .*?. But far more important is that it takes practically no time to figure out when no more matches are possible. If there's a Golden Rule of regex performance it's this: when a match attempt is going to fail, it should fail as quickly as possible.

Apprehend answered 7/2, 2012 at 7:43 Comment(0)
A
1

You might check your pcre.backtrack_limit setting. It would have to be ridiculously low to prevent that regex from matching that input, but you did say you'd been messing around with the setup...

You can try testing it by changing the regex. When I tested it in RegexBuddy, your regex matched that input in 1216 steps. When I changed it to this:

'%<tr.*?>.*?<b>.*?</b>[^<]*(?:<(?!/?tr\b)[^<]*)*</tr>%ims'

...it only took 441 steps.

Apprehend answered 3/2, 2012 at 7:47 Comment(2)
That regex also returned false. I found the ;pcre.backtrack_limit=100000 in my php.ini and changed it to pcre.backtrack_limit=100000 but it made no difference eitherThemselves
Ok it turns it out uncommenting the backtrack_limit line and adding three zeros sorted it; i.e. pcre.backtrack_limit=100000000Themselves

© 2022 - 2024 — McMap. All rights reserved.