How many captured groups are supported by pcre2_substitute() function?
Asked Answered
F

2

4

I am using pcre2_substitute() function in my c++ project to perform regex replace:

int ret=pcre2_substitute(
  re,                    /*Points to the compiled pattern*/
  subject,               /*Points to the subject string*/
  subject_length,        /*Length of the subject string*/
  0,                     /*Offset in the subject at which to start matching*/
  rplopts,               /*Option bits*/
  0,                     /*Points to a match data block, or is NULL*/
  0,                     /*Points to a match context, or is NULL*/
  replace,               /*Points to the replacement string*/
  replace_length,        /*Length of the replacement string*/
  output,                /*Points to the output buffer*/
  &outlengthptr          /*Points to the length of the output buffer*/
);

This is the man page of the function. It doesn't say how many captured groups are possible. I have tested that $01, ${6}, $12 works, but what is the limit?

I checked if there's a digit limit like the C++ std::regex, but there isn't. $000000000000001 works as $1 while in std::regex it would mean $00 and the rest would be treated as string.

The code I am using for testing is this one. You will need pcre2 library to run this code.

Frissell answered 25/11, 2015 at 17:46 Comment(3)
Just a FYI. The title states backreferences, but backreferences are constructs in the regular expression that refer to captured data. On the replacement side, capture buffers are just variables.Gropius
Also, I noticed in your code (this one) that your regex contains 4 capture groups. Yet, you are trying to substitute $1234. That is capture group number 1,234 not capture group 1,2,3,4. For a real test, programmatically create a regex with about 10,000 capture groups. Set an appropriate subject string. Then try to do a substitution using $1234.Gropius
@sln I had to test for various scenerios, the example code is just one of them. I have tested with $1111 with more than 1111 captured groups and it gave correct result. Anyway, the question is solvedFrissell
T
4

The maximum number of capturing groups is 65,535. And this is also the maximum group number that can be backreferenced in the pattern or in the replacement.

However, generally speaking, a match will probably reach another limit before allowing that big amount of groups: e.g. the maximum length of the subject string, or the number of times match() is called internally (in total, or recursively), though match limits can be increased. For detailed information about match limits, see "The match context" in pcre2api.


From pcre2limits man page

There is no limit to the number of parenthesized subpatterns, but there can be no more than 65,535 capturing subpatterns.

There is, however, a limit to the depth of nesting of parenthesized subpatterns of all kinds. This is imposed in order to limit the amount of system stack used at compile time. The limit can be specified when PCRE2 is built; the default is 250.

and

The maximum number of named subpatterns is 10,000.

By Philip Hazel. Last updated: 25 November 2014. - *As of PCRE2 version 10.20


Size limitations in PCRE and PCRE2

PCRE and PCRE2 have the same limits:

  • All values in repeating quantifiers are limited to 65,535.

  • Unlimited number of parenthesized subpatterns
    (though it's limited to the depth of nesting of parenthesized subpatterns of all kinds).

  • 65,535 capturing subpatterns.

  • 10,000 named subpatterns.

  • The default maximum depth of nested parentheses is 250
    (value of PCRE2_CONFIG_PARENSLIMIT).

  • The maximum length of names for named subpattern is 32 code units.
    A char is represented by 1+ code units (depending on encoding). E.g. in UTF-8 "Ç" has 2 code units: 0xC3 0x87

  • There is no limit to the number of backward references.

  • The limit to the number of forward references to subsequent subpatterns is around 200,000.

  • Names used in control verbs are limited to 255 (8-bit) and 65,535 (16 or 32-bit).

  • The default value for PCRE2_CONFIG_MATCHLIMIT is 10,000,000 (10m).

  • The default value for PCRE2_CONFIG_RECURSIONLIMIT is 10,000,000 (10m).
    (this limit only applies if it's set smaller than MATCH_LIMIT).

  • The maximum length of a compiled pattern is 64K code units if compiled with the default internal linkage size of 2 (see the pcre2build documentation for details).

  • The maximum length of a subject string is the largest positive number that an integer variable can hold (may be ~1.8E+19). However, the available stack space may limit the size of a subject string that can be processed by certain patterns.
    The maximum length (in code units) of a subject string is one less than the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an unsigned integer type, usually defined as size_t.

Transept answered 25/11, 2015 at 23:41 Comment(1)
@sln Good point. Added a brief example and reference the first time it's mentioned.Transept
C
2

PCRE2 package

It is written in pcre2-10.20\README that PCRE2 has a counter that limits the depth of nesting of parentheses in a pattern. This limits the amount of system stack that a pattern uses when it is compiled. The default is 250, but you can change it by setting, for example, --with-parens-nest-limit=500 and PCRE2 has a counter that can be set to limit the amount of resources it uses when matching a pattern. If the limit is exceeded during a match, the match fails. The default is ten million. You can change the default by setting, for example, --with-match-limit=500000.

So, it seems that the number of backreferences

  • is NOT hardcoded into PCRE2
  • is most probably dependant on the size if the match limit or match limit recursion parameters.

Since you are building the library yourself, you can increase that even further.

PCRE Online reference

From regular-expressions.info:

Most regex flavors support up to 99 capturing groups and double-digit backreferences. So \99 is a valid backreference if your regex has 99 capturing groups.

And Regular Expression Reference: Capturing Groups and Backreferences page:

Backreference \1 through \9
Backreference \10 through \99

Note that by "most" the author must be referring to the major regex engines (PHP, JavaScript, Python, .NET), as opposed to, POSIX BRE, POSIX ERE, GNU BRE, GNU ERE regex flavors that only support backreferences up to \9.

However, in pcre.txt, there is a line.

\ddd     character with octal code ddd, or backreference

So, it is possible to have 999 groups according to this document.

Coincidental answered 25/11, 2015 at 17:59 Comment(7)
${112} and ${1111} worked, may be $9999 (4, 9s) exists too. I am starting to think that there is no limit....Frissell
Could you please let me know how you test? I have just tried to use '${100}' in the replacement pattern, and it returned as a literal ${100}.Unreconstructed
I think so.. PHP uses pcre. And the max size of pattern in pcre depends on the configuration on which it was built. That's 10million match, 10 million recursive match and 250 nested parentheses by default conf... It's literally unlimited..:DFrissell
I am doing it with c++ program with the original pcre (pcre2) library.. the code is pretty long. though i can add it to the post if you want..Frissell
Well, it would be great. I will try to run the tests on my end, too.Unreconstructed
done, added a link to rextester. you will need to link it with pcre2 library to build it...Frissell
Note that the replacement limit in PHP is due to PHP's implementation of preg_replace (preg_get_backref specifically), not PCRE's limitation. Anyway, the part about PHP totally derails from the question, which asks for the limitation in PCRE library. I suggest removing it altogether and post them in a separate Q&A.Pedicle

© 2022 - 2024 — McMap. All rights reserved.