How to create a list of overlapping segments of a string in R?

Asked 24/8, 2021 at 7:41 Answered 24/8, 2021 at 7:57

For a string like ‘ABCDEFG’, is it possible to split into different lists of overlapping segments with different lengths? For example, with 2 letters: ‘AB’, ‘BC’, ‘CD’,’DE’,’EF’,’FG’. With 3 letters: ‘ABC’, ‘BCD’, ‘CDE’, ‘DEF’, ‘EFG’ And so on. These segments should be a shift for just one letter rather than simple split.

Thank you very much.

Worth answered 24/8, 2021 at 7:41 Comment(2)

consecutive ≠ overlapping ;-) – Unruh 24/8, 2021 at 7:45

Oh sorry. Thanks for correcting. – Worth 24/8, 2021 at 7:47

I'm not that good and I don't know if that's what you searched, but I think it might do the trick with package stringr.

string <- "ABCDEF"
library(stringr)

combinated_letters <- function(string, n) {
  length_ <- str_length(string)
  str_sub(string, seq(1, length_ + 1 - n), seq(n, length_))
}

combinated_letters(string, 1)
combinated_letters(string, 2)
combinated_letters(string, 3)
combinated_letters(string, 4)
combinated_letters(string, 5)
combinated_letters(string, 6)

With the result :

> combinated_letters(string, 1)
[1] "A" "B" "C" "D" "E" "F"
> combinated_letters(string, 2)
[1] "AB" "BC" "CD" "DE" "EF"
> combinated_letters(string, 3)
[1] "ABC" "BCD" "CDE" "DEF"
> combinated_letters(string, 4)
[1] "ABCD" "BCDE" "CDEF"
> combinated_letters(string, 5)
[1] "ABCDE" "BCDEF"
> combinated_letters(string, 6)
[1] "ABCDEF"

Vitascope answered 24/8, 2021 at 7:49 Comment(0)

There’s no builtin way, unfortunately. That said, doing this manually is fairly straightforward.

Given:

x = 'ABCDEFG'
len = 3L

start = seq_len(nchar(x) - len + 1L)
result = vapply(start, \(s) substr(x, s, s + len - 1L), character(1L))

Or, wrapped in a function (as mentioned, these overlapping substrings are called “ngrams”):

ngrams = function (x, len) {
  start = seq_len(nchar(x) - len + 1L)
  vapply(start, \(s) substr(x, s, s + len - 1L), character(1L))
}

Alternatively you can use substring() instead of substr() + vapply(), because substring() is vectorised:

ngrams = function (x, len) {
  start = seq_len(nchar(x) - len + 1L)
  substring(x, start, start + len - 1L)
}

However, since it uses cyclic expansion of its argument lengths, substring() is somewhat error-prone when the input isn’t what was expected.

Unruh answered 24/8, 2021 at 7:50 Comment(0)

Yes, these are called n-grams, in this case, character n-grams. n is equal to the number of characters you want to extract.

You can use existing functions to extract those very efficiently:

With `stringdist`:

stringdist::qgrams("ABCDEFG", q = 2)

#    AB BC CD DE EF FG
# V1  1  1  1  1  1  1

This will return a table of counts for each character bigram/n-gram (use a different value for q).

With `quanteda`:

library(quanteda)

"ABCDEFG" %>% 
  tokens("character") %>% 
  unlist() %>% 
  char_ngrams(2, concatenator = "")

# [1] "AB" "BC" "CD" "DE" "EF" "FG"

This will return the list of bigrams/n-grams (change the value of n). You can optionally activate the options remove_punct (to remove all punctuation) or remove_symbols in quanteda::tokens() if you need some preprocessing.

Floriated answered 24/8, 2021 at 7:57 Comment(3)

Those 2 options are way slower than the one from @Guillaume Mulier – Seismo 21/9, 2023 at 12:6

An even faster option is found here https://mcmap.net/q/1924921/-split-a-string-in-consecutive-substrings-of-size-n-in-r-in-an-efficient-way – Seismo 21/9, 2023 at 12:6

The ` stringdist` prpposed solution is not correct – Seismo 21/9, 2023 at 12:9

With `stringdist`:

With `quanteda`:

Recommended topics

Hot tags

With stringdist:

With quanteda:

Recommended topics

Hot tags

With `stringdist`:

With `quanteda`: