Consider the following commands:
text <- "abcdEEEEfg"
sub("c.+?E", "###", text)
# [1] "ab###EEEfg" <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg" <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg" <<< OKAY
The first does exactly what I expect, basically matching just the first E. The second one should essentially be identical to the first, since all I'm doing is adding a capturing group (though I'm not using it), yet for some reason it captures an extra E. That said, it isn't fully greedy (i.e. if it was it would have captured all the Es). Even weirder, it actually still matches the pattern, even though the sub
result suggests the .+?
piece left out EE
, which can no longer be matched by the rest of the regular expression. This suggests there is an offset issue when computing the length of the matched sub-expression, rather than in the actual matching.
The final one is exactly the same but run with PCRE, and that works as expected.
Am I missing something or is this behavior undocumented/buggy?