Why am I seeing different results for these two nearly identical Ruby regex patterns, and why is one matching what I think it shouldn't?
Asked Answered
G

1

6

Using Ruby 1.9.2, I have the following Ruby code in IRB:

> r1 = /^(?=.*[\d])(?=.*[\W]).{8,20}$/i
> r2 = /^(?=.*\d)(?=.*\W).{8,20}$/i
> a = ["password", "1password", "password1", "pass1word", "password 1"]
> a.each {|p| puts "r1: #{r1.match(p) ? "+" : "-"} \"#{p}\"".ljust(25) + "r2: #{r2.match(p) ? "+" : "-"} \"#{p}\""}

This results in the following output:

r1: - "password"         r2: - "password"
r1: + "1password"        r2: - "1password"
r1: + "password1"        r2: - "password1"
r1: + "pass1word"        r2: - "pass1word"
r1: + "password 1"       r2: + "password 1"

1.) Why do the results differ?

2.) Why would r1 match on strings 2, 3 and 4? Wouldn't the (?=.*[\W]) lookahead cause it to fail since there aren't any non-word characters in those examples?

Globeflower answered 26/11, 2012 at 21:4 Comment(4)
Could you please try to match /^(?=.*[\d])(?=.*([\W])).{8,20}$/i and tell use what is captured in capturing group 1? (I'm afraid it's the digit, but you never know)Botello
Results using Ruby 1.9.3-p327: r1: - "password" r2: - "password" r1: - "1password" r2: - "1password" r1: - "password1" r2: - "password1" r1: - "pass1word" r2: - "pass1word" r1: + "password 1" r2: + "password 1" => ["password", "1password", "password1", "pass1word", "password 1"] Looks like you may have found a bug with 1.9.2?Brewage
Could you include that in your question please (for the sake of proper formatting)Botello
@ilanberci, I'm still seeing the same exact results in 1.9.3-p327. gist.github.com/a839ed2b3efdc949b894Globeflower
I
6

This results from the interaction between a couple of regex features and Unicode. \W is all non-word characters, which includes 212A - "KELVIN SIGN" (PDF link) and 017F - "LATIN SMALL LETTER LONG S" ſ (PDF link). The /i adds lower case versions of both of these, which are the “normal” k and s characters (006B - "LATIN SMALL LETTER K" and 0073 "LATIN SMALL LETTER S" (PDF link)).

So it’s the s in password that’s being interpreted as a non-word character in certain cases.

Note that this only seems to occur when the \W is in a character class (i.e. [\W]). Also I can only reproduce this in irb, inside a standalone script it seems to work as expected.

See the Ruby bug about this for more information.

Ionogen answered 26/11, 2012 at 22:14 Comment(4)
Good catch. Not that it matters, but the actual problem is not ß (which is folded to ss), but 017F - LATIN SMALL LETTER LONG S ſ (which is folded to a single s).Hodgepodge
@Pumbaa80 Thanks, that makes more sense, I’ve updated the answer. I took ß from a different comment on the bug report. In this case ß would also match because of the double-s in password, but the actual match is a single s, so it’s probably ſ.Ionogen
Wow, that's an interesting feature :) Thanks for the explanation and the link to the bug report.Globeflower
Since my regex doesn't ultimately need to be case sensitive, simply leaving that flag off results in the expected behavior.Globeflower

© 2022 - 2024 — McMap. All rights reserved.