Extract part of string between two different patterns
Asked Answered
F

4

6

I try to use stringr package to extract part of a string, which is between two particular patterns.

For example, I have:

my.string <- "nanaqwertybaba"
left.border  <- "nana"
right.border <- "baba"

and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression) I would like to receive:

"qwerty"

Solutions from Google did not work.

Fredia answered 7/4, 2014 at 22:21 Comment(0)
P
9

I do not know whether and how this is possible with functions provided by stringr but you can also use base regexpr and substring:

pattern <- paste0("(?<=", left.border, ")[a-z]+(?=", right.border, ")")
# "(?<=nana)[a-z]+(?=baba)"

rx <- regexpr(pattern, text=my.string, perl=TRUE)
# [1] 5
# attr(,"match.length")
# [1] 6

substring(my.string, rx, rx+attr(rx, "match.length")-1)
# [1] "qwerty"
Phototypy answered 7/4, 2014 at 22:43 Comment(1)
Thank you, sigbb! I have just adjusted it a little bit, so as to: 1. match all characters between left.border and right.border, 2. match up to first occurence of right.border and now I have: rx <- regexpr(paste0("(?<=", left.border, ")(.*?)+(?=", right.border, ")"), text = my.string, perl = TRUE). Big thank you to you!Fredia
C
14

In base R you can use gsub. The parentheses in the pattern create numbered capturing groups. Here we select the second group in the replacement, i.e. the group between the borders. The . matches any character. The * means that there is zero or more of the preceeding element

gsub(pattern = "(.*nana)(.*)(baba.*)",
     replacement = "\\2",
     x = "xxxnanaRisnicebabayyy")
# "Risnice"
Crystallize answered 7/4, 2014 at 22:46 Comment(2)
Well, the point is I do not know that "qwerty" does sit here, do there is no way I use it in the regex pattern!Fredia
@Marciszka: you can replace "qwerty" in this example by an regular expression as well, e.g. gsub(pattern = "(.*nana)([[:alpha:]]+)(baba.*)", "\\2", x=my.string) for at least one letter.Phototypy
P
9

I do not know whether and how this is possible with functions provided by stringr but you can also use base regexpr and substring:

pattern <- paste0("(?<=", left.border, ")[a-z]+(?=", right.border, ")")
# "(?<=nana)[a-z]+(?=baba)"

rx <- regexpr(pattern, text=my.string, perl=TRUE)
# [1] 5
# attr(,"match.length")
# [1] 6

substring(my.string, rx, rx+attr(rx, "match.length")-1)
# [1] "qwerty"
Phototypy answered 7/4, 2014 at 22:43 Comment(1)
Thank you, sigbb! I have just adjusted it a little bit, so as to: 1. match all characters between left.border and right.border, 2. match up to first occurence of right.border and now I have: rx <- regexpr(paste0("(?<=", left.border, ")(.*?)+(?=", right.border, ")"), text = my.string, perl = TRUE). Big thank you to you!Fredia
V
7

I would use str_match from stringr: "str_match extracts capture groups formed by () from the first match. It returns a character matrix with one column for the complete match and one column for each group." ref

str_match(my.string, paste(left.border, '(.+)', right.border, sep=''))[,2]

The code above creates a regular expression with paste concatenating the capture group (.+) that captures 1 or more characters, with left and right borders (no spaces between strings).

A single match is assumed. So, [,2] selects the second column from the matrix returned by str_match.

Viborg answered 11/2, 2015 at 9:52 Comment(0)
Z
0

You can use the package unglue:

library(unglue)
my.string <- "nanaqwertybaba"
unglue_vec(my.string, "nana{res}baba")
#> [1] "qwerty"
Zoology answered 8/10, 2019 at 21:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.