I've been using gsub
extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:
library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)
for (i in rpt) {
n = n + 1
print(n)
patt = paste(rep("a", rpt[n]), collapse = "")
#time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
msecFF[n] = mean(timeFF$time)/1000000.
timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
msecFT[n] = mean(timeFT$time)/1000000.
}
library(ggplot2)
library(grid)
library(gridExtra)
axis(1,at=seq(0,1000,200),labels=T)
p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)
As you see, I'm looking for a pattern that contains a
replicated rpt[n]
times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T
and 600 characters with fixed=F
and then the slope seems to be approximately as before (see plot below).
I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern
is 1463 symbols, with object size of 1552 bytes.
Can someone explain the kink better and why at 300 and 600 characters?
Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp
in the example above) the following timing.
gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern
(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)
stringi
equivalentstringiF <- microbenchmark(stri_replace_all_fixed(str = inp, pattern = patt, replacement = "b"), times = 10)
;mean_stringiF[n] <- mean(stringiF$time)/1000000
;qplot(rpt, mean_stringiF)
, showed no similar increase with pattern length, at least not over the range tested here. – Locativeperl=TRUE
ingsub
withfixed=FALSE
I also see no slope (doesn't matter forfixed = TRUE
). However, I do need to usefixed = TRUE
as it is substantially faster in my application – Schramstri_replace
function execution time jumps at 5 :) Because for patterns longer or equal to 5, the KMP algorithm is used. For patterns shorter than 5 plain naive search is done. – Kocher