Split a string in consecutive substrings of size n in R in an efficient way
Asked Answered
L

2

1
# Input
n <- 2
"abcd" 
# Output
c("ab", "bc", "cd")

I don't want to use a for loop or sapply

Lenorelenox answered 21/9, 2023 at 8:27 Comment(1)
C
7

You may use substring -

get_n_grams <- function(string, n) {
  len <- nchar(string)
  substring(string, seq_len(len - n + 1), n:len)    
}

get_n_grams("abcd", 2)
#[1] "ab" "bc" "cd"

get_n_grams("abcd", 3)
#[1] "abc" "bcd"
Crampton answered 21/9, 2023 at 8:32 Comment(2)
Is it possible to vectorise the function with respect to string?Lenorelenox
Using the current answer, I don't think so. Your best bet would be sapply(string, get_n_grams, 2)Crampton
D
1

This embed trick could work but might be not as efficient as the substring approach by @Ronak Shah

> n <- 2

> s <- "abcd"

> apply(embed(utf8ToInt(s), n)[, n:1], 1, intToUtf8)
[1] "ab" "bc" "cd"
Drogin answered 21/9, 2023 at 8:39 Comment(3)
embed has a for loop insideLenorelenox
OMG you are doing a thedailyWTF here -- converting what might or might not be utf8 there and back again!Latifundium
@CarlWitthoft just for fun, not for seriousDrogin

© 2022 - 2024 — McMap. All rights reserved.