Fast and efficient substring extraction and comparison in R

Asked 10/10, 2023 at 9:51 Answered 10/10, 2023 at 11:48

Solved r string substring memory-efficient stringdist

I have a problem concerning very fast and efficient comparison between the substrings of two strings in my dataset, which won't run fast enough despite pretty powerful machinery. I have a data.table with 2 columns and about 1.5 billion rows, which has this structure:

library(data.table)
library(stringr)
library(stringi)
library(stringdist)

dt <- data.frame(c("002134", "024345", "176234"), c("002003", "024234", "002004"))
colnames(dt) <- c("class1", "class2")
setDT(dt)

What I want is a function that (1) extracts the first 3 digits from each string by row for both vectors, (2) compares the substrings between the two vectors, and (3) create a new boolean variable that reports whether the two substrings are equal or not.

So the desired result is as follows:

dt$sameclass <- c(TRUE, TRUE, FALSE)
print(dt)
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

I have tried versions of stringr and stringi both within data.table functionality and without. For comparing the substrings I use stringdist, since to my understanding can be parallelized which would be very beneficial on my server. However, the bottleneck still seems to be the substring extraction.


#stringi + stringdist without data.table: 
dt$redclass1 <- stri_sub(dt$class1, to = 3)
dt$redclass2 <- stri_sub(dt$class2, to = 3)
dt[, classdist := stringdist(a = redclass1, b = redclass2, method = "hamming")] 
dt[, sameclass := (classdist == 0)] 

#stringi + stringdist within data.table: 
dt[, classdist := stringdist(a = stri_sub(dt$class1, to = 3), b = stri_sub(dt$class2, to = 3), method = "hamming")] 
dt[, sameclass := (classdist == 0)] 

#stringr with separate function: 
sameclass <- function(subclass1, subclass2, classdepth){
  truthvalue <- (str_sub(subclass1, end = classdepth) == str_sub(subclass2, end = classdepth))
  return(truthvalue)
} 
dt[, sameclass := sameclass(subclass1 = class1, subclass2 = class2, classdepth = 3), by = seq_len(nrow(dt))]

All versions either run into memory problems or take several hours to a day to run. Since I need to do this repeatedly this does not work for me, and I wanted to ask if you can come up with something faster/ more efficient. Any help would be greatly appreciated!

EDIT

I have benchmarked some of the methods suggested here, which indeed show a substantial speedup:

dt <- data.frame(rep(c("002134", "024345", "176234"), 1000), rep(c("002003", "024234", "002004"), 1000))
colnames(dt) <- c("class1", "class2")
setDT(dt)

times <- microbenchmark(

startswithtest = dt[, startsWith(class2, substring(class1, 1, 3))],
lapplytest = dt[, do.call(`==`, lapply(.SD, substring, 1, 3)), .SDcols = c("class1", "class2")],
numerictest = dt[, as.numeric(class1)%/%1000 == as.numeric(class2)%/%1000],
functiontest = dt[, sameclass(subclass1 = class1, subclass2 = class2, classdepth = 3), by = seq_len(nrow(dt))],
stringitest = dt[, stringdist(a = stri_sub(dt$class1, to = 3), b = stri_sub(dt$class2, to = 3), method = "hamming")], 
times = 50
)
times
           expr       min        lq       mean     median         uq        max neval
 startswithtest   312.501   356.901   593.4530   444.8515    737.301   1692.602    50
     lapplytest   383.602   439.201   736.3512   522.7010    966.901   2259.601    50
    numerictest  1763.100  1932.600  3229.6651  2399.7510   4153.201   8396.301    50
   functiontest 45677.700 61124.002 81567.9409 77844.5510 100084.901 133921.502    50
    stringitest   794.201  1028.200  1423.5289  1259.6005   1739.400   3640.701    50

I will go with the startsWith for now, since it seems to offer the highest speed (unfortunately I was not able to use the C-function due to restrictions on my server). Thank you for the help!

Tyburn answered 10/10, 2023 at 9:51 Comment(0)

Probably you can try

> dt[, sameclass := do.call(`==`, lapply(.SD, substring, 1, 3))][]
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

or maybe faster

> dt[, sameclass := startsWith(class2, substring(class1, 1, 3))][]
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

Pettway answered 10/10, 2023 at 9:57 Comment(0)

Since we're talking speed, a benchmark might be handy. I compared one of @Nils R's approaches, both @ThomasIsCoding's and @Onyambu's and a slightly optimized version of @Onyambu's C++ function (which basically reimplements Thomas' startsWith approach):

Rcpp::cppFunction("
std::vector<bool> compare2(std::vector<std::string> x, std::vector<std::string> y){
    std::vector<bool> b;
    for (std::size_t i = 0; i < x.size(); ++i) {
        for (std::size_t j = 0; j < 3; ++j)
            if (x[i][j] != y[i][j]) {
                b.push_back(false);
                break;
            } else if (j == 2)
                b.push_back(true);
    }
    return b;
}")

Data and benchmark setup:

n <- 100000
dt <- data.frame(rep(c("002134", "024345", "176234"), n), 
                 rep(c("002003", "024234", "002004"), n))
colnames(dt) <- c("class1", "class2")
setDT(dt)
dt1 <- copy(dt)

microbenchmark::microbenchmark(
  nils.stringdist = {
    dt[, classdist := stringdist(a = stri_sub(class1, to = 3), 
                                 b = stri_sub(class2, to = 3), method = "hamming")] 
    dt[, sameclass := (classdist == 0)]},
  thomas.eq = {
    dt1[, sameclass := do.call(`==`, lapply(.SD, substring, 1, 3))]
    dt1[, sameclass := NULL]},
  thomas.startsWith = dt[, sameclass := startsWith(class2, substring(class1, 1, 3))],
  onyambu.cpp = dt[, sameclass := compare(class1, class2)],
  onyambu.numeric = dt[, sameclass := 
                         as.numeric(class1) %/% 1000 == as.numeric(class2) %/% 1000],
  cpp2 = dt[, sameclass := compare2(class1, class2)]
)

And the results:

Unit: milliseconds
              expr      min        lq      mean    median        uq      max neval cld
   nils.stringdist  66.3890  70.08755  80.65836  75.80350  86.11460 133.0720   100  b 
         thomas.eq  35.2667  37.75490  43.33864  39.82885  44.83400  92.9219   100 a  
 thomas.startsWith  24.1990  26.61705  34.60704  29.93990  34.12790 124.3365   100 a  
       onyambu.cpp  30.5307  32.63700  36.83997  34.12350  37.55005 156.4901   100 a  
   onyambu.numeric 271.1953 297.88525 324.14725 310.84680 331.45795 806.8805   100   c
              cpp2  26.8051  28.00815  35.86227  29.96770  32.19430 428.3867   100 a

It appears that custom C++ functions are not significantly more efficient as there are no performance bottlenecks related to the usage of substring/startsWith. The startsWith approach was systematically faster across simulations but not dramatically so due to the short strings' length.

Bimetallism answered 10/10, 2023 at 11:48 Comment(1)

Thank you for that, I was just editing my post so I didn't see your answer, but this is much more helpful since it includes the C-version! +1 from me – Tyburn 10/10, 2023 at 11:54

In case you are still looking for speed consider using a c++ for-loop

Rcpp::cppFunction("
std::vector<bool> compare(std::vector<std::string> x, std::vector<std::string> y){
   std::vector<bool> b;
   for (std::size_t i = 0; i < x.size();i++)
       b.push_back(x[i].substr(0,3) == y[i].substr(0,3));
 return b;
}")
dt[, sameclass:=compare(class1, class2)][]
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

Edit

As your strings are digits, you could do integer division:

dt[, sameclass:=as.numeric(class1)%/%1000 == as.numeric(class2)%/%1000][]
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

Spinks answered 10/10, 2023 at 10:19 Comment(0)

Probably you can try

> dt[, sameclass := do.call(`==`, lapply(.SD, substring, 1, 3))][]
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

or maybe faster

> dt[, sameclass := startsWith(class2, substring(class1, 1, 3))][]
   class1 class2 sameclass
1: 002134 002003      TRUE
2: 024345 024234      TRUE
3: 176234 002004     FALSE

Pettway answered 10/10, 2023 at 9:57 Comment(0)

Edit

Recommended topics

Hot tags