I have a problem concerning very fast and efficient comparison between the substrings of two strings in my dataset, which won't run fast enough despite pretty powerful machinery.
I have a data.table
with 2 columns and about 1.5 billion rows, which has this structure:
library(data.table)
library(stringr)
library(stringi)
library(stringdist)
dt <- data.frame(c("002134", "024345", "176234"), c("002003", "024234", "002004"))
colnames(dt) <- c("class1", "class2")
setDT(dt)
What I want is a function that (1) extracts the first 3 digits from each string by row for both vectors, (2) compares the substrings between the two vectors, and (3) create a new boolean variable that reports whether the two substrings are equal or not.
So the desired result is as follows:
dt$sameclass <- c(TRUE, TRUE, FALSE)
print(dt)
class1 class2 sameclass
1: 002134 002003 TRUE
2: 024345 024234 TRUE
3: 176234 002004 FALSE
I have tried versions of stringr
and stringi
both within data.table
functionality and without. For comparing the substrings I use stringdist
, since to my understanding can be parallelized which would be very beneficial on my server. However, the bottleneck still seems to be the substring extraction.
#stringi + stringdist without data.table:
dt$redclass1 <- stri_sub(dt$class1, to = 3)
dt$redclass2 <- stri_sub(dt$class2, to = 3)
dt[, classdist := stringdist(a = redclass1, b = redclass2, method = "hamming")]
dt[, sameclass := (classdist == 0)]
#stringi + stringdist within data.table:
dt[, classdist := stringdist(a = stri_sub(dt$class1, to = 3), b = stri_sub(dt$class2, to = 3), method = "hamming")]
dt[, sameclass := (classdist == 0)]
#stringr with separate function:
sameclass <- function(subclass1, subclass2, classdepth){
truthvalue <- (str_sub(subclass1, end = classdepth) == str_sub(subclass2, end = classdepth))
return(truthvalue)
}
dt[, sameclass := sameclass(subclass1 = class1, subclass2 = class2, classdepth = 3), by = seq_len(nrow(dt))]
All versions either run into memory problems or take several hours to a day to run. Since I need to do this repeatedly this does not work for me, and I wanted to ask if you can come up with something faster/ more efficient. Any help would be greatly appreciated!
EDIT
I have benchmarked some of the methods suggested here, which indeed show a substantial speedup:
dt <- data.frame(rep(c("002134", "024345", "176234"), 1000), rep(c("002003", "024234", "002004"), 1000))
colnames(dt) <- c("class1", "class2")
setDT(dt)
times <- microbenchmark(
startswithtest = dt[, startsWith(class2, substring(class1, 1, 3))],
lapplytest = dt[, do.call(`==`, lapply(.SD, substring, 1, 3)), .SDcols = c("class1", "class2")],
numerictest = dt[, as.numeric(class1)%/%1000 == as.numeric(class2)%/%1000],
functiontest = dt[, sameclass(subclass1 = class1, subclass2 = class2, classdepth = 3), by = seq_len(nrow(dt))],
stringitest = dt[, stringdist(a = stri_sub(dt$class1, to = 3), b = stri_sub(dt$class2, to = 3), method = "hamming")],
times = 50
)
times
expr min lq mean median uq max neval
startswithtest 312.501 356.901 593.4530 444.8515 737.301 1692.602 50
lapplytest 383.602 439.201 736.3512 522.7010 966.901 2259.601 50
numerictest 1763.100 1932.600 3229.6651 2399.7510 4153.201 8396.301 50
functiontest 45677.700 61124.002 81567.9409 77844.5510 100084.901 133921.502 50
stringitest 794.201 1028.200 1423.5289 1259.6005 1739.400 3640.701 50
I will go with the startsWith for now, since it seems to offer the highest speed (unfortunately I was not able to use the C-function due to restrictions on my server). Thank you for the help!