Subset string by counting specific characters

Asked 27/12, 2018 at 19:53 Answered 27/12, 2018 at 23:41

I have the following strings:

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:

some_function(strings)

c("ABBSDGN", "AABSDG", "AGN", "GGG")

I tried to use the stringi, stringr and regex expressions but I can't figure it out.

Cram answered 27/12, 2018 at 19:53 Comment(0)

You can accomplish your task with a simple call to str_extract from the stringr package:

library(stringr)

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:

str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN"  NA           "AGNA"       "GGGDSRTYHG"

There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:

m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Alternatively, you can use sub:

sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Veracruz answered 27/12, 2018 at 23:41 Comment(1)

I don't think it can get much better to the one-liner str_extract(strings, '([^AGN]*[AGN]){3}'). Nice one! – Schoenfelder 28/12, 2018 at 3:3

Here is a base R option using strsplit

sapply(strsplit(strings, ""), function(x)
    paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Or in the tidyverse

library(tidyverse)
map_chr(str_split(strings, ""), 
    ~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))

Curler answered 27/12, 2018 at 20:16 Comment(0)

Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.

nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))

PS:

If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.

Racism answered 27/12, 2018 at 20:19 Comment(1)

Nice! substr is vectorized, so I would simplify your last line like this: substr(strings, 1, map_int(gregexpr(pattern, strings), nChars)), where map_int from purrr is used. – Veracruz 27/12, 2018 at 22:15

This is just a version without strsplit to Maurits Evers neat solution.

sapply(strings,
       function(x) {
         raw <- rawToChar(charToRaw(x), multiple = TRUE)
         idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
         paste(raw[1:idx], collapse = "")
       })
## ABBSDGNHNGA   AABSDGDRY      AGNAFG  GGGDSRTYHG 
##   "ABBSDGN"    "AABSDG"       "AGN"       "GGG"

Or, slightly different, without strsplit and paste:

test <- charToRaw("AGN")
sapply(strings,
       function(x) {
         raw <- charToRaw(x)
         idx <- which.max(cumsum(raw %in% test) == 3)
         rawToChar(raw[1:idx])
       })

Dogma answered 27/12, 2018 at 21:21 Comment(0)

Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.

 reduce_strings = function(str, chars, cnt){

  # Replacing chars in str with "!"
  chars = paste0(chars, collapse = "")
  replacement = paste0(rep("!", nchar(chars)), collapse = "")
  str_alias = chartr(chars, replacement, str) 

  # Obtain indices with ! for each string
  idx = stringr::str_locate_all(pattern = '!', str_alias)

  # Reduce each string in str
  reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
  result = vapply(seq_along(str), reduce, "character")
  return(result)
}

# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG") 
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

Scilla answered 27/12, 2018 at 20:48 Comment(0)

Recommended topics

Hot tags