stringr, str_extract: how to do positive lookbehind?

Asked 4/3, 2016 at 19:21 Answered 9/11, 2021 at 23:59

Very simple problem. I just need to capture some strings using a regex positive lookbehind, but I don't see a way to do it.

Here's an example, suppose I have some strings:

library(stringr)
myStrings <- c("MFG: acme", "something else", "MFG: initech")

I want to extract the words which are prefixed with "MFG:"

> result_1  <- str_extract(myStrings,"MFG\\s*:\\s*\\w+")
>
> result_1
[1] "MFG: acme"    NA             "MFG: initech"

That almost does it, but I don't want to include the "MFG:" part, so that's what a "positive lookbehind" is for:

> result_2  <- str_extract(myStrings,"(?<=MFG\\s*:\\s*)\\w+")
Error in stri_extract_first_regex(string, pattern, opts_regex = attr(pattern,  : 
  Look-Behind pattern matches must have a bounded maximum length. (U_REGEX_LOOK_BEHIND_LIMIT)
>

It is complaining about needing a "bounded maximum length", but I don't see where to specify that. How do I make positive-lookbehind work? Where, exactly, can I specify this "bounded maximum length"?

Pearle answered 4/3, 2016 at 19:21 Comment(2)

Aha! Stringr's regex requires a limitation on the lookbehind! – Pearle 4/3, 2016 at 19:33

Mostly (except for as laid out by Wiktor in the answer below) lookarounds are fixed-width, so you can't use quantifiers. – Alina 4/3, 2016 at 19:38

You need to use str_match since the pattern for "lookbehind" is a literal, and you just do not know the number of whitespaces:

> result_1  <- str_match(myStrings,"MFG\\s*:\\s*(\\w+)")
> result_1[,2]
##[1] "acme"    NA        "initech"

The results you need will be in the second column.

Note the str_extract cannot be used here since that function drops the captured values.

And a bonus: the lookbehind is not infinite-width, but it is constrained-width in ICU regex. So, this will also work:

> result_1  <- str_extract(myStrings,"(?<=MFG\\s{0,100}:\\s{0,100})\\w+")
> result_1
[1] "acme"    NA        "initech"

Windhoek answered 4/3, 2016 at 19:23 Comment(3)

I see now, thanks! this is what I need to do if I can't limit the size of the lookbehind. – Pearle 4/3, 2016 at 19:32

Whoa, I didn't know you could quantify in lookarounds with {} in stringr regex; that's exciting. That will fail in base or perl = TRUE regex, though. – Alina 4/3, 2016 at 19:36

That is why it is called a constrained width lookbehind: if the length can be calculated (and with a limiting quantifier with both min and max values it is possible) it can be used. – Theodicy 4/3, 2016 at 19:39

We can use a regex lookaround. The lookbehind would take only exact matches.

str_extract(myStrings, "(?<=MFG:\\s)\\w+")
#[1] "acme"    NA        "initech"

Reel answered 4/3, 2016 at 19:22 Comment(3)

Thanks! Yes, my regex had worked in .NET, but R's regex is a bit different! It makes sense now that the \\s* was the problem. – Pearle 4/3, 2016 at 19:28

There is no universal R regex, different modules use different flavors. There is TRE, PCRE and ICU regex flavors. – Theodicy 4/3, 2016 at 19:31

@WiktorStribiżew, unfortunately, I can't know them all. Just have to wing-it until something stops working. – Pearle 4/3, 2016 at 19:35

I wrote the code in python using lookbehind. if the parser find MFG: then it will grab the next word

txt="MFG: acme, something else, MFG: initech"
pattern=r"(?<=MFG\:)\s+\w+"
matches=re.findall(pattern,txt)
for match in matches:
   print(match)

output:

 acme
 initech

Macrospore answered 9/11, 2021 at 23:59 Comment(0)

Recommended topics

Hot tags