Why am I seeing different results for these two nearly identical Ruby regex patterns, and why is one matching what I think it shouldn't?

About

Asked 26/11, 2012 at 21:4 Answered 26/11, 2012 at 22:14

Solved ruby regex unicode character-class

Using Ruby 1.9.2, I have the following Ruby code in IRB:

> r1 = /^(?=.*[\d])(?=.*[\W]).{8,20}$/i
> r2 = /^(?=.*\d)(?=.*\W).{8,20}$/i
> a = ["password", "1password", "password1", "pass1word", "password 1"]
> a.each {|p| puts "r1: #{r1.match(p) ? "+" : "-"} \"#{p}\"".ljust(25) + "r2: #{r2.match(p) ? "+" : "-"} \"#{p}\""}

This results in the following output:

r1: - "password"         r2: - "password"
r1: + "1password"        r2: - "1password"
r1: + "password1"        r2: - "password1"
r1: + "pass1word"        r2: - "pass1word"
r1: + "password 1"       r2: + "password 1"

1.) Why do the results differ?

2.) Why would r1 match on strings 2, 3 and 4? Wouldn't the (?=.*[\W]) lookahead cause it to fail since there aren't any non-word characters in those examples?

Globeflower answered 26/11, 2012 at 21:4 Comment(4)

Could you please try to match /^(?=.*[\d])(?=.*([\W])).{8,20}$/i and tell use what is captured in capturing group 1? (I'm afraid it's the digit, but you never know) – Botello 26/11, 2012 at 21:22

Results using Ruby 1.9.3-p327: r1: - "password" r2: - "password" r1: - "1password" r2: - "1password" r1: - "password1" r2: - "password1" r1: - "pass1word" r2: - "pass1word" r1: + "password 1" r2: + "password 1" => ["password", "1password", "password1", "pass1word", "password 1"] Looks like you may have found a bug with 1.9.2? – Brewage 26/11, 2012 at 21:30

Could you include that in your question please (for the sake of proper formatting) – Botello 26/11, 2012 at 21:39

@ilanberci, I'm still seeing the same exact results in 1.9.3-p327. gist.github.com/a839ed2b3efdc949b894 – Globeflower 27/11, 2012 at 15:19

This results from the interaction between a couple of regex features and Unicode. \W is all non-word characters, which includes 212A - "KELVIN SIGN" K (PDF link) and 017F - "LATIN SMALL LETTER LONG S" ſ (PDF link). The /i adds lower case versions of both of these, which are the “normal” k and s characters (006B - "LATIN SMALL LETTER K" and 0073 "LATIN SMALL LETTER S" (PDF link)).

So it’s the s in password that’s being interpreted as a non-word character in certain cases.

Note that this only seems to occur when the \W is in a character class (i.e. [\W]). Also I can only reproduce this in irb, inside a standalone script it seems to work as expected.

See the Ruby bug about this for more information.

Ionogen answered 26/11, 2012 at 22:14 Comment(4)

Good catch. Not that it matters, but the actual problem is not ß (which is folded to ss), but 017F - LATIN SMALL LETTER LONG S ſ (which is folded to a single s). – Hodgepodge 27/11, 2012 at 0:8

@Pumbaa80 Thanks, that makes more sense, I’ve updated the answer. I took ß from a different comment on the bug report. In this case ß would also match because of the double-s in password, but the actual match is a single s, so it’s probably ſ. – Ionogen 27/11, 2012 at 9:26

Wow, that's an interesting feature :) Thanks for the explanation and the link to the bug report. – Globeflower 27/11, 2012 at 13:17

Since my regex doesn't ultimately need to be case sensitive, simply leaving that flag off results in the expected behavior. – Globeflower 27/11, 2012 at 15:21

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags