gsub speed vs pattern length - McMap

About

gsub speed vs pattern length

Asked 17/12, 2014 at 20:29 Answered 5/2, 2015 at 20:42

regex r string gsub stringi

S

1

26

I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:

library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)

for (i in rpt) {
  n = n + 1
  print(n)
  patt = paste(rep("a", rpt[n]), collapse = "")
  #time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
  timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
  msecFF[n] = mean(timeFF$time)/1000000.

  timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
  msecFT[n] = mean(timeFT$time)/1000000.
}

library(ggplot2)
library(grid)
library(gridExtra)

axis(1,at=seq(0,1000,200),labels=T)

p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)

As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below). I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.

Can someone explain the kink better and why at 300 and 600 characters?

gsub speed in milliseconds with fixed turned on/off as a function of pattern length in characters

Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.

gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern

(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)

Real data timing

Schram answered 17/12, 2014 at 20:29 Comment(6)

Not at all an answer to your question, still it was interesting to see that a stringi equivalent stringiF <- microbenchmark(stri_replace_all_fixed(str = inp, pattern = patt, replacement = "b"), times = 10); mean_stringiF[n] <- mean(stringiF$time)/1000000; qplot(rpt, mean_stringiF), showed no similar increase with pattern length, at least not over the range tested here. – Locative 17/12, 2014 at 21:10

If I use perl=TRUE in gsub with fixed=FALSE I also see no slope (doesn't matter for fixed = TRUE). However, I do need to use fixed = TRUE as it is substantially faster in my application – Schram 17/12, 2014 at 21:17

I know why in stri_replace function execution time jumps at 5 :) Because for patterns longer or equal to 5, the KMP algorithm is used. For patterns shorter than 5 plain naive search is done. – Kocher 11/1, 2015 at 16:55

Your plots would be cleaner if your domain started at n=2 instead of n=1. By cleaner I mean that the region of interest would own the majority of the space on the graph. – Alienation 22/1, 2015 at 15:47

What's the question? – Jodhpurs 23/1, 2015 at 13:58

Can someone explain the kink better and why at 300 and 600 characters? – Schram 23/1, 2015 at 17:1

C

4

The kinks might be related to the bits required to hold patterns of that length.

There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.

patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)

I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.

Chiffon answered 5/2, 2015 at 20:42 Comment(2)

Thanks for the idea. Unfortunately, all my patterns are unique and typically just regular words with no repeats. For my application the fixed=FALSE provides the most correct (though slower than fixed=TRUE) solution due to other characters that might accompany the target (e.g. "caps," etc) – Schram 5/2, 2015 at 20:59

Here's a live example. a = "appless, appless. apples- . I need to changes `apples' to 'apple', leaving all punctuation intact. I have another thread on it, why and what I need to do: #27368414 – Schram 5/2, 2015 at 21:9

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.