Overlapping matches in R

M

6

14

I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.

I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.

I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.

But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.

> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""

The same goes for using both the stringi and stringr package.

> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""

The correct results that should be returned when executing this are:

[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Edit

I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.
Is the stringi and stringr package not capable of performing this over regmatches?
Please feel free to add to my answer or come up with a different workaround than I have found.

Markson answered 12/9, 2014 at 2:56 Comment(0)

P

7

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

Paternal answered 12/9, 2014 at 3:37 Comment(3)

+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this? – Markson 12/9, 2014 at 3:45

I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.` – Paternal 12/9, 2014 at 3:50

Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches. – Markson 12/9, 2014 at 3:53

M

7

As far as a workaround, this is what I have come up with to extract the overlapping matches.

> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Please feel free to add or comment on a better way to perform this task.

Markson answered 12/9, 2014 at 2:56 Comment(2)

The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this: – Creamy 10/8, 2015 at 14:46

Oops. I forgot I can't put code blocks in comments. Will make this a separate answer. – Creamy 10/8, 2015 at 14:48

P

7

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

Paternal answered 12/9, 2014 at 3:37 Comment(3)

+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this? – Markson 12/9, 2014 at 3:45

I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.` – Paternal 12/9, 2014 at 3:50

Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches. – Markson 12/9, 2014 at 3:53

C

5

A stringi solution using a capture group in the look-ahead part:

> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Candlefish answered 26/10, 2014 at 19:55 Comment(5)

Weird, how come it failed to work with stri_extract_all_regex – Markson 26/10, 2014 at 20:0

@hwnd: it's a 0-length match; (?=...) does not advance the input position. – Candlefish 26/10, 2014 at 20:2

Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex – Markson 26/10, 2014 at 20:4

No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :) – Candlefish 26/10, 2014 at 20:5

Ok now I see and understand what you mean. – Markson 26/10, 2014 at 20:6

B

4

Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":

x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i) {
       attr(i,"match.length") <- attr(i,"capture.length")
       i
     })
regmatches(x,m)

#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Blende answered 12/9, 2014 at 5:10 Comment(1)

+1 Thanks for the additional solution. I've done similar using capture.start and capture.length. – Markson 12/9, 2014 at 5:28

B

4

It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.

x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Bareilly answered 13/9, 2014 at 2:54 Comment(0)

C

1

An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:

> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Pretty ugly, which is why the stringr etc. packages exist.

Creamy answered 10/8, 2015 at 14:51 Comment(0)

Edit

Recommended topics

Hot tags