How to specify Regexp for unicode cyrillic characters in Ruby 1.9
Asked Answered
P

1

7
#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8> 
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why \w ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/. Here is my output of ruby -v

ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

As far as i know 1.9 oniguruma regular expression library has full support for unicode characters.

Photoemission answered 27/4, 2010 at 14:6 Comment(4)
on Linux (ruby 1.9) gsub remove all characters - irb(main):006:0> str2.gsub(/\w/u,'') => ""Pomposity
@aaz: it shouldn't (see my answer); probably you have an old version?Labyrinthine
I would rename this question as "How to specify Regexp for unicode characters in Ruby 1.9", since this is not related to win32 nor to (only) cyrillic.Labyrinthine
you are right. its a bug in ruby 1.9.1p0, in ruby 1.9.1p376 all works wellPomposity
A
11

This is as specified in the Ruby documentation: \w is equivalent to [a-zA-Z0-9_] and thus doesn't target any unicode character.

You probably want to use [[:alnum:]] instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]] and [[:alpha:]].

Ankylose answered 27/4, 2010 at 17:26 Comment(1)
BTW, we can thank Run Paint Run Run for writing this documentation.Labyrinthine

© 2022 - 2024 — McMap. All rights reserved.