PHP preg_match_all limit
Asked Answered
S

3

4

I'm using preg_match_all for very long pattern.

when run the code, i got this error :

Warning: preg_match_all(): Compilation failed: regular expression is too large at offset 707830

After searching, I got the solution, so I should increase value of pcre.backtrack_limit and pcre.recursion_limit in php.ini

But after I increase the value and restart my apache, it still got the same problem. My PHP verison is 5.3.8

Slider answered 25/11, 2011 at 11:43 Comment(1)
Please post the regular expression that you are using.Hooghly
N
7

increasing the PCRE backtrack and recursion limit may fix the problem, but will still fail when the size of your data hits the new limit. (doesn't scale well with more data)

example:

<?php 
// essential for huge PCREs
ini_set("pcre.backtrack_limit", "23001337");
ini_set("pcre.recursion_limit", "23001337");
// imagine your PCRE here...
?>

to really solve the underlying problem, you must optimize your expression and (if possible) split your complex expression into "parts" and move some logic to PHP. I hope you get the idea by reading the example .. instead of trying to find the sub-structure directly with a single PCRE, i demonstrate a more "iterative" approach going deeper and deeper into the structure using PHP. example:

<?php
$html = file_get_contents("huge_input.html");

// first find all tables, and work on those later
$res = preg_match_all("!<table.*>(?P<content>.*)</table>!isU", $html, $table_matches);

if ($res) foreach($table_matches['content'] as $table_match) {  

    // now find all cells in each table that was found earlier ..
    $res = preg_match_all("!<td.*>(?P<content>.*)</td>!isU", $table_match, $cell_matches);

    if ($res) foreach($cell_matches['content'] as $cell_match) {

        // imagine going deeper and deeper into the structure here...
        echo "found a table cell! content: ", $cell_match;

    }    
}
Neology answered 25/11, 2011 at 12:36 Comment(1)
Actually for my case, the pattern it self is very long. i have blocked site listed separated by | example sex.com|porn.com|bad.com. your solution seem good. after i try to separated the pattern to smaller part, it's working well :) Thanks KaiiSlider
A
12

That error is not about the performance of the regex, it's about the regex itself. Changing the pcre.backtrack_limit and pcre.recursion_limit isn't going to have any effect because the regex never gets a chance to run. The problem is that the regex is too big, and the solution is to make the regex smaller--much, much smaller.

Antlion answered 25/11, 2011 at 12:41 Comment(0)
N
7

increasing the PCRE backtrack and recursion limit may fix the problem, but will still fail when the size of your data hits the new limit. (doesn't scale well with more data)

example:

<?php 
// essential for huge PCREs
ini_set("pcre.backtrack_limit", "23001337");
ini_set("pcre.recursion_limit", "23001337");
// imagine your PCRE here...
?>

to really solve the underlying problem, you must optimize your expression and (if possible) split your complex expression into "parts" and move some logic to PHP. I hope you get the idea by reading the example .. instead of trying to find the sub-structure directly with a single PCRE, i demonstrate a more "iterative" approach going deeper and deeper into the structure using PHP. example:

<?php
$html = file_get_contents("huge_input.html");

// first find all tables, and work on those later
$res = preg_match_all("!<table.*>(?P<content>.*)</table>!isU", $html, $table_matches);

if ($res) foreach($table_matches['content'] as $table_match) {  

    // now find all cells in each table that was found earlier ..
    $res = preg_match_all("!<td.*>(?P<content>.*)</td>!isU", $table_match, $cell_matches);

    if ($res) foreach($cell_matches['content'] as $cell_match) {

        // imagine going deeper and deeper into the structure here...
        echo "found a table cell! content: ", $cell_match;

    }    
}
Neology answered 25/11, 2011 at 12:36 Comment(1)
Actually for my case, the pattern it self is very long. i have blocked site listed separated by | example sex.com|porn.com|bad.com. your solution seem good. after i try to separated the pattern to smaller part, it's working well :) Thanks KaiiSlider
K
4

I'm writing this answer, because I stumbeled across the same problem. As Alan Moore pointed out the adjusting the backtrack and recursion limits won't help to solve the problem.

The described error occurs when a needle exceeds the largest possible needle size, which is limited by the underlying pcre library. The described error is NOT caused by php, but by the underlying pcre library. It's the error message #20 which is defined here:

https://github.com/php/.../pcre_compile.c#L477

php just prints the errortext it received from the pcre library on failure.

However, this error appears in my environment, when I try using previously captured fragments as a needle and they're bigger than 32k bytes.

It can easily be tested by using this simple script from php's cli

<?php
// This script demonstrates the above error and dumps an info
// when the needle is too long or with 64k iterations.

$expand=$needle="_^b_";
while( ! preg_match( $needle, "Stack Exchange Demo Text" ) )
{
    // Die after 64 kbytes of accumulated chunk needle
    // Adjust to 32k for a better illustration
    if ( strlen($expand) > 1024*64 ) die();

    if ( $expand == "_^b_" ) $expand = "";
    $expand .= "a";
    $needle = '_^'.$needle.'_ism';

    echo strlen($needle)."\n";

}
?>

To fix the error, either the resulting needle must be reduced or - if everything is needed to be captured - multiple preg_match with the additional offset parameter must be used.

<?php
    if ( 
        preg_match( 
            '/'.preg_quote( 
                    substr( $big_chunk, 0, 20*1024 ) // 1st 20k chars
                ) 
                .'.*?'. 
                preg_quote( 
                    substr( $big_chunk, -5 ) // last 5
                ) 
            .'/', 
            $subject 
        ) 
    ) { 
        // do stuff
    }

    // The match all needles in text attempt
    if ( preg_match( 
            $needle_of_1st_32kbytes_chunk, 
            $subj, $matches, $flags = 0, 
            $offset = 32*1024*0 // Offset -> 0
        )
        && preg_match( 
            $needle_of_2nd_32kbytes_chunk, 
            $subj, $matches, $flags = 0, 
            $offset = 32*1024*1 // Offset -> 32k
        )
        // && ... as many preg matches as needed
    ) {
        // do stuff
    }

    // it would be nicer to put the texts in a foreach-loop iterating
    // over the existings chunks 
?>

You get the idea.

Allthough this answer is kinda laaaaate, I hope it still helps people who run into this problem without a good explanation why the error occurs.

Kennethkennett answered 24/2, 2016 at 14:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.