R-regex: match strings not beginning with a pattern

Asked 8/12, 2011 at 21:49 Answered 13/1, 2020 at 11:0

I'd like to use regex to see if a string does not begin with a certain pattern. While I can use: [^ to blacklist certain characters, I can't figure out how to blacklist a pattern.

> grepl("^[^abc].+$", "foo")
[1] TRUE
> grepl("^[^abc].+$", "afoo")
[1] FALSE

I'd like to do something like grepl("^[^(abc)].+$", "afoo") and get TRUE, i.e. to match if the string does not start with abc sequence.

Note that I'm aware of this post, and I also tried using perl = TRUE, but with no success:

> grepl("^((?!hede).)*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^((?!hede).)*$", "foohede", perl = TRUE)
[1] FALSE

Any ideas?

Doable answered 8/12, 2011 at 21:49 Comment(2)

Could you match strings that do begin with the pattern, then negate the logical result from grepl? – Upswing 8/12, 2011 at 21:53

Sure, but I'd like to put some more stuff in there! =) – Doable 8/12, 2011 at 21:57

Yeah. Put the zero width lookahead /outside/ the other parens. That should give you this:

> grepl("^(?!hede).*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^(?!hede).*$", "foohede", perl = TRUE)
[1] TRUE

which I think is what you want.

Alternately if you want to capture the entire string, ^(?!hede)(.*)$ and ^((?!hede).*)$ are both equivalent and acceptable.

Channa answered 8/12, 2011 at 21:53 Comment(0)

There is now (years later) another possibility with the stringr package.

library(stringr)

str_detect("dsadsf", "^abc", negate = TRUE)
#> [1] TRUE

str_detect("abcff", "^abc", negate = TRUE)
#> [1] FALSE

^{Created on 2020-01-13 by the reprex package (v0.3.0)}

Lillis answered 13/1, 2020 at 11:0 Comment(0)

I got stuck on the following special case, so I thought I would share...

What if there are multiple instances of the regular expression, but you still only want the first segment?

Apparently you can turn off the implicit greediness of the search with specific perl wildcard modifiers

Suppose the string I wanted to process was

myExampleString = paste0(c(letters[1:13], "_", letters[14:26], "__",
                           LETTERS[1:13], "_", LETTERS[14:26], "__",
                           "laksjdl", "_", "lakdjlfalsjdf"),
                         collapse = "")
myExampleString

"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd"

and that I wanted only the first segment before the first "__". I cannot simply search on "_", because single-underscore is an allowable non-delimiter in this example string.

The following doesn't work. It instead gives me the first and second segments because of the default greediness (but not third, because of the forward-look).

gsub("^(.+(?=__)).*$", "\\1", myExampleString, perl = TRUE)

"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ"

But this does work

gsub("^(.+?(?=__)).*$", "\\1", myExampleString, perl = TRUE)

"abcdefghijklm_nopqrstuvwxyz"

The difference is the greedy-modifier "?" after the wildcard ".+" in the (perl) regular expression.

Publishing answered 8/4, 2016 at 21:4 Comment(0)

What if there are multiple instances of the regular expression, but you still only want the first segment?

Recommended topics

Hot tags