R: workaround for variable-width lookbehind
Asked Answered
B

2

5

Given this vector:

ba <- c('baa','aba','abba','abbba','aaba','aabba')'

I want to change the final a of each word to i except baa and aba.

I wrote the following line ...

gsub('(?<=a[ab]b{1,2})a','i',ba,perl=T)

but was told: PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')a'.

I looked around a little bit and apparently R/Perl can only lookahead for a variable width, not lookbehind. Any workaround to this problem? Thanks!

Bevus answered 27/3, 2015 at 19:4 Comment(0)
B
7

You can use the lookbehind alternative \K instead. This escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.

Quotedrexegg

The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before \K.

Using it in context:

sub('a[ab]b{1,2}\\Ka', 'i', ba, perl=T)
# [1] "baa"   "aba"   "abbi"  "abbbi" "aabi"  "aabbi"

Avoiding lookarounds:

sub('(a[ab]b{1,2})a', '\\1i', ba)
# [1] "baa"   "aba"   "abbi"  "abbbi" "aabi"  "aabbi"
Bournemouth answered 27/3, 2015 at 19:18 Comment(2)
Can I also ask if there is an equivalent of \\K in the other direction, i.e. resetting the end point of the reported match?Bevus
Yes, \G if I follow what you are asking.Bournemouth
U
2

Another solution for the current case only, when the only quantifier used is a limiting quantifier, may be using stringr::str_replace_all / stringr::str_replace:

> library(stringr)
> str_replace_all(ba, '(?<=a[ab]b{1,2})a', 'i')
[1] "baa"   "aba"   "abbi"  "abbbi" "aabi"  "aabbi"

It works because stringr regex functions are based on ICU regex that features a constrained-width lookbehind:

The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)

So, you can't really use any kind of patterns inside ICU lookbehinds, but it is good to know you may use at least a limiting quantifier in it when you need to get overlapping texts within a known distance range.

Unsling answered 29/10, 2019 at 14:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.