How to specify Regexp for unicode cyrillic characters in Ruby 1.9

About

Asked 27/4, 2010 at 14:6 Answered 27/4, 2010 at 17:26

Solved ruby regex unicode encoding character-properties

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8> 
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why \w ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/. Here is my output of ruby -v

ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

As far as i know 1.9 oniguruma regular expression library has full support for unicode characters.

Photoemission answered 27/4, 2010 at 14:6 Comment(4)

on Linux (ruby 1.9) gsub remove all characters - irb(main):006:0> str2.gsub(/\w/u,'') => "" – Pomposity 27/4, 2010 at 14:58

@aaz: it shouldn't (see my answer); probably you have an old version? – Labyrinthine 27/4, 2010 at 17:28

I would rename this question as "How to specify Regexp for unicode characters in Ruby 1.9", since this is not related to win32 nor to (only) cyrillic. – Labyrinthine 27/4, 2010 at 17:41

you are right. its a bug in ruby 1.9.1p0, in ruby 1.9.1p376 all works well – Pomposity 27/4, 2010 at 20:20

This is as specified in the Ruby documentation: \w is equivalent to [a-zA-Z0-9_] and thus doesn't target any unicode character.

You probably want to use [[:alnum:]] instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]] and [[:alpha:]].

Ankylose answered 27/4, 2010 at 17:26 Comment(1)

BTW, we can thank Run Paint Run Run for writing this documentation. – Labyrinthine 27/4, 2010 at 17:51

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags