Fastest way to check if a string matches a regexp in ruby?
Asked Answered
H

7

123

What is the fastest way to check if a string matches a regular expression in Ruby?

My problem is that I have to "egrep" through a huge list of strings to find which are the ones that match a regexp that is given at runtime. I only care about whether the string matches the regexp, not where it matches, nor what the content of the matching groups is. I hope this assumption can be used to reduce the amount of time my code spend matching regexps.

I load the regexp with

pattern = Regexp.new(ptx).freeze

I have found that string =~ pattern is slightly faster than string.match(pattern).

Are there other tricks or shortcuts that can used to make this test even faster?

Halvaard answered 9/8, 2012 at 15:49 Comment(3)
If you don't care about the content of the matching groups, why do you have them? You can make the regex faster by converting them to non-capturing.Foetation
Since the regexp is provided at run-time, I assume it's unconstrained, in which case there may be internal references within the reg-exp to groupings, and therefore converting them to non-capturing by modifying the regexp could modify the result (unless you additionally check for internal references, but the problem becomes increasingly complex). I find it curious =~ would be faster than string.match.Disharmonious
what is the benefit of freezing the regexp here?Ascocarp
B
147

Starting with Ruby 2.4.0, you may use RegExp#match?:

pattern.match?(string)

Regexp#match? is explicitly listed as a performance enhancement in the release notes for 2.4.0, as it avoids object allocations performed by other methods such as Regexp#match and =~:

Regexp#match?
Added Regexp#match?, which executes a regexp match without creating a back reference object and changing $~ to reduce object allocation.

Berryman answered 16/3, 2017 at 12:30 Comment(2)
Thank you for the suggestion. I have updated the benchmark script and Regexp#match? is indeed at least 50% faster than the other alternatives.Halvaard
Here's a link to the Ruby docs for Regexp.match?: ruby-doc.org/3.2.0/Regexp.html#method-i-match-3FResponsive
H
82

This is a simple benchmark:

require 'benchmark'

"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=>   0.610000   0.000000   0.610000 (  0.578133)

"test123"[/1/]
=> "1"
Benchmark.measure{ 1000000.times { "test123"[/1/] } }
=>   0.718000   0.000000   0.718000 (  0.750010)

irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=>   1.703000   0.000000   1.703000 (  1.578146)

So =~ is faster but it depends what you want to have as a returned value. If you just want to check if the text contains a regex or not use =~

Hexamethylenetetramine answered 9/8, 2012 at 17:3 Comment(5)
As I wrote, I already found that =~ is faster than match, with a less dramatic performance increase when operating on bigger regexps. What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct.Halvaard
I think there is no other solutionsHexamethylenetetramine
What about !("test123" !~ /1/)?Twopiece
@MattDiPasquale, two times the inverse should not be faster than "test123" =~ /1/Hexamethylenetetramine
/1/.match?("test123") is faster than "test123" =~ /1/ if it is only to check if the text contains a regex or not.Homogeneity
H
47

This is the benchmark I have run after finding some articles around the net.

With 2.4.0 the winner is re.match?(str) (as suggested by @wiktor-stribiżew), on previous versions, re =~ str seems to be fastest, although str =~ re is almost as fast.

#!/usr/bin/env ruby
require 'benchmark'

str = "aacaabc"
re = Regexp.new('a+b').freeze

N = 4_000_000

Benchmark.bm do |b|
    b.report("str.match re\t") { N.times { str.match re } }
    b.report("str =~ re\t")    { N.times { str =~ re } }
    b.report("str[re]  \t")    { N.times { str[re] } }
    b.report("re =~ str\t")    { N.times { re =~ str } }
    b.report("re.match str\t") { N.times { re.match str } }
    if re.respond_to?(:match?)
        b.report("re.match? str\t") { N.times { re.match? str } }
    end
end

Results MRI 1.9.3-o551:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         2.390000   0.000000   2.390000 (  2.397331)
str =~ re         2.450000   0.000000   2.450000 (  2.446893)
str[re]           2.940000   0.010000   2.950000 (  2.941666)
re.match str      3.620000   0.000000   3.620000 (  3.619922)
str.match re      4.180000   0.000000   4.180000 (  4.180083)

Results MRI 2.1.5:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         1.150000   0.000000   1.150000 (  1.144880)
str =~ re         1.160000   0.000000   1.160000 (  1.150691)
str[re]           1.330000   0.000000   1.330000 (  1.337064)
re.match str      2.250000   0.000000   2.250000 (  2.255142)
str.match re      2.270000   0.000000   2.270000 (  2.270948)

Results MRI 2.3.3 (there is a regression in regex matching, it seems):

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         3.540000   0.000000   3.540000 (  3.535881)
str =~ re         3.560000   0.000000   3.560000 (  3.560657)
str[re]           4.300000   0.000000   4.300000 (  4.299403)
re.match str      5.210000   0.010000   5.220000 (  5.213041)
str.match re      6.000000   0.000000   6.000000 (  6.000465)

Results MRI 2.4.0:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re.match? str     0.690000   0.010000   0.700000 (  0.682934)
re =~ str         1.040000   0.000000   1.040000 (  1.035863)
str =~ re         1.040000   0.000000   1.040000 (  1.042963)
str[re]           1.340000   0.000000   1.340000 (  1.339704)
re.match str      2.040000   0.000000   2.040000 (  2.046464)
str.match re      2.180000   0.000000   2.180000 (  2.174691)
Halvaard answered 10/8, 2012 at 19:41 Comment(2)
Just to add note, literal forms are faster than these. E.g. /a+b/ =~ str and str =~ /a+b/. It's valid even when iterating them through functions and I see this valid enough to consider as better than storing and freezing regular expressions on a variable. I tested my script with ruby 1.9.3p547, ruby 2.0.0p481 and ruby 2.1.4p265. It's possible that these improvements were made on later patches but I have no plan to test it with earlier versions/patches yet.Warm
I thought !(re !~ str) might be faster, but it's not.Twopiece
T
7

What about re === str (case compare)?

Since it evaluates to true or false and has no need for storing matches, returning match index and that stuff, I wonder if it would be an even faster way of matching than =~.


Ok, I tested this. =~ is still faster, even if you have multiple capture groups, however it is faster than the other options.

BTW, what good is freeze? I couldn't measure any performance boost from it.

Tuckie answered 10/5, 2013 at 21:23 Comment(1)
The effects of freeze won't show up in the results because it occurs prior to the benchmark loops, and acts on the pattern itself.Popliteal
M
5

Depending on how complicated your regular expression is, you could possibly just use simple string slicing. I'm not sure about the practicality of this for your application or whether or not it would actually offer any speed improvements.

'testsentence'['stsen']
=> 'stsen' # evaluates to true
'testsentence'['koala']
=> nil # evaluates to false
Mcwhirter answered 9/8, 2012 at 15:56 Comment(2)
I cannot use string slicing because the regexp is provided at run-time and I do not have any control over that.Halvaard
You can use string slicing, just not slicing using a fixed-string. Use a variable instead of a string in quotes and it'd still work.Popliteal
P
3

What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct.

Regexp engines vary in how they implement searches, but, in general, anchor your patterns for speed, and avoid greedy matches, especially when searching long strings.

The best thing to do, until you're familiar with how a particular engine works, is to do benchmarks and add/remove anchors, try limiting searches, use wildcards vs. explicit matches, etc.

The Fruity gem is very useful for quickly benchmarking things, because it's smart. Ruby's built-in Benchmark code is also useful, though you can write tests that fool you by not being careful.

I've used both in many answers here on Stack Overflow, so you can search through my answers and will see lots of little tricks and results to give you ideas of how to write faster code.

The biggest thing to remember is, it's bad to prematurely optimize your code before you know where the slowdowns occur.

Popliteal answered 3/10, 2014 at 23:51 Comment(0)
H
1

To complete Wiktor Stribiżew and Dougui answers I would say that /regex/.match?("string") about as fast as "string".match?(/regex/).

Ruby 2.4.0 (10 000 000 ~2 sec)

2.4.0 > require 'benchmark'
 => true 
2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
 => #<Benchmark::Tms:0x005563da1b1c80 @label="", @real=2.2060338060000504, @cstime=0.0, @cutime=0.0, @stime=0.04000000000000001, @utime=2.17, @total=2.21> 
2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
 => #<Benchmark::Tms:0x005563da139eb0 @label="", @real=2.260814556000696, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=2.2500000000000004, @total=2.2600000000000007> 

Ruby 2.6.2 (100 000 000 ~20 sec)

irb(main):001:0> require 'benchmark'
=> true
irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x0000562bc83e3768 @label="", @real=24.60139879199778, @cstime=0.0, @cutime=0.0, @stime=0.010000999999999996, @utime=24.565644999999996, @total=24.575645999999995>
irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x0000562bc846aee8 @label="", @real=24.634255946999474, @cstime=0.0, @cutime=0.0, @stime=0.010046, @utime=24.598276, @total=24.608321999999998>

Note: times varies, sometimes /regex/.match?("string") is faster and sometimes "string".match?(/regex/), the differences maybe only due to the machine activity.

Homogeneity answered 4/12, 2017 at 22:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.