gsub speed vs pattern length
Asked Answered
S

1

26

I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:

library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)

for (i in rpt) {
  n = n + 1
  print(n)
  patt = paste(rep("a", rpt[n]), collapse = "")
  #time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
  timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
  msecFF[n] = mean(timeFF$time)/1000000.

  timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
  msecFT[n] = mean(timeFT$time)/1000000.
}

library(ggplot2)
library(grid)
library(gridExtra)

axis(1,at=seq(0,1000,200),labels=T)

p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)

As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below). I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.

Can someone explain the kink better and why at 300 and 600 characters?

gsub speed in milliseconds with fixed turned on/off as a function of pattern length in characters

Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.

gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern

(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)

Real data timing

Schram answered 17/12, 2014 at 20:29 Comment(6)
Not at all an answer to your question, still it was interesting to see that a stringi equivalent stringiF <- microbenchmark(stri_replace_all_fixed(str = inp, pattern = patt, replacement = "b"), times = 10); mean_stringiF[n] <- mean(stringiF$time)/1000000; qplot(rpt, mean_stringiF), showed no similar increase with pattern length, at least not over the range tested here.Locative
If I use perl=TRUE in gsub with fixed=FALSE I also see no slope (doesn't matter for fixed = TRUE). However, I do need to use fixed = TRUE as it is substantially faster in my applicationSchram
I know why in stri_replace function execution time jumps at 5 :) Because for patterns longer or equal to 5, the KMP algorithm is used. For patterns shorter than 5 plain naive search is done.Kocher
Your plots would be cleaner if your domain started at n=2 instead of n=1. By cleaner I mean that the region of interest would own the majority of the space on the graph.Alienation
What's the question?Jodhpurs
Can someone explain the kink better and why at 300 and 600 characters?Schram
C
4

The kinks might be related to the bits required to hold patterns of that length.

There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.

patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)

I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.

Chiffon answered 5/2, 2015 at 20:42 Comment(2)
Thanks for the idea. Unfortunately, all my patterns are unique and typically just regular words with no repeats. For my application the fixed=FALSE provides the most correct (though slower than fixed=TRUE) solution due to other characters that might accompany the target (e.g. "caps," etc)Schram
Here's a live example. a = "appless, appless. apples- . I need to changes `apples' to 'apple', leaving all punctuation intact. I have another thread on it, why and what I need to do: #27368414Schram

© 2022 - 2024 — McMap. All rights reserved.