Odd Behavior with Greedy Modifiers Inside Capture Groups
Asked Answered
A

1

9

Consider the following commands:

text <- "abcdEEEEfg"

sub("c.+?E", "###", text)
# [1] "ab###EEEfg"                          <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg"                           <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg"                          <<< OKAY  

The first does exactly what I expect, basically matching just the first E. The second one should essentially be identical to the first, since all I'm doing is adding a capturing group (though I'm not using it), yet for some reason it captures an extra E. That said, it isn't fully greedy (i.e. if it was it would have captured all the Es). Even weirder, it actually still matches the pattern, even though the sub result suggests the .+? piece left out EE, which can no longer be matched by the rest of the regular expression. This suggests there is an offset issue when computing the length of the matched sub-expression, rather than in the actual matching.

The final one is exactly the same but run with PCRE, and that works as expected.

Am I missing something or is this behavior undocumented/buggy?

Adorno answered 26/2, 2014 at 23:45 Comment(2)
This smells like a bug in R.Apothecary
Posted as a bug on the tre github page.Adorno
D
2

R uses libtre, version 0.8. For more stability, you should always use perl = TRUE.

Note that

sub("c(.+?)E?", "###", text)

works.

Dayna answered 27/2, 2014 at 2:16 Comment(4)
This is what I've always done, but there are some things not implemented with the perl = T flag (regexec in particular). My actual bug had come up while trying to use regexec (or more specifically, the str_match_all/etc. tools in stringr that rely on it) and I was similarly able to work around it by adding .* after the pattern, though for the sub example it obviously doesn't work. It no one else has more info by the morning I'll take this as the answer. Do you know if there are any plans to update the library? Looks like 0.8 has been around for 4 years.Adorno
Actually, looks like the TRE library has already been updated (search for TRE).Adorno
I fixed my answer to reflect the update. It doesn't look like development is continuing on libtre. There are several open issues, one of which is about R. I think this should be raised as a bug to the R development team.Dayna
I submitted this to R and got sent packing suggesting I submit it to TRE instead. I submitted it to laurikari as well, though I suspect you're right that it is the same issue you link.Adorno

© 2022 - 2024 — McMap. All rights reserved.