How to determine if a character is a Chinese character
Asked Answered
B

2

10

How to determine if a character is a Chinese character using ruby?

Barbuto answered 28/4, 2010 at 8:22 Comment(1)
They usually have more strokes than katakana or hiragana. And you're generally only supposed to use ruby on the more complex kanji ... wait a moment, is this Japanese.SE or stack overflow?Umbrageous
L
7

An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also)

I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and Korean characters (some characters are shared between them) - not sure if you can distinguish which are Chinese only.

I think you can check if it's a CJK character by calling this on string str and character with index n:

def check_char(str, n)
  list_of_chars = str.unpack("U*")
  char = list_of_chars[n]
  #main blocks
  if char >= 0x4E00 && char <= 0x9FFF
    return true
  end
  #extended block A
  if char >= 0x3400 && char <= 0x4DBF
    return true
  end
  #extended block B
  if char >= 0x20000 && char <= 0x2A6DF
    return true
  end
  #extended block C
  if char >= 0x2A700 && char <= 0x2B73F
    return true
  end
  return false
end
Louisiana answered 28/4, 2010 at 9:48 Comment(6)
@sam they are the CJK ranges. These are the Chinese, Japanese, and Korean characters (assuming the ranges are correct, which I believe they are)Athematic
@Michael Lowman, they returned false for a few characters I tested using Korean, Japanese and all 1..9 a..z not to mention they do return true for Chinese. How could I go about checking if the character is traditional or simplified.Bulwerlytton
Also, where did these ranges come from. unihan? what specific page?Bulwerlytton
On the mentioned wikipedia page each of the blocks has a list of charts with the characters it contains. I used those ranges.Louisiana
is it possible to distinguish between traditional and simplified forms?Bulwerlytton
Not very easily, but this library does this trick. Ruby 1.9+ only. github.com/jpatokal/script_detectorRodrigo
K
18

Ruby 1.9

#encoding: utf-8   
 "漢" =~ /\p{Han}/
Kat answered 28/4, 2010 at 8:37 Comment(4)
I use this code,but it's can't work。 This is error info:invalid character property name {Han}: /\p{Han}/Barbuto
@HelloWorld: Update your version of Ruby. All characters classes are documented now: github.com/ruby/ruby/blob/trunk/doc/re.rdoc (cool nick, BTW)Berfield
The link above is broken, but you can find all information in the ruby docs for regexp: ruby-doc.org/core-2.0.0/Regexp.html#label-Character+PropertiesPriestly
If you're getting "invalid character property name {Han}", you can sometimes solve this by adding /u: "漢" =~ /\p{Han}/uSteamroller
L
7

An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also)

I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and Korean characters (some characters are shared between them) - not sure if you can distinguish which are Chinese only.

I think you can check if it's a CJK character by calling this on string str and character with index n:

def check_char(str, n)
  list_of_chars = str.unpack("U*")
  char = list_of_chars[n]
  #main blocks
  if char >= 0x4E00 && char <= 0x9FFF
    return true
  end
  #extended block A
  if char >= 0x3400 && char <= 0x4DBF
    return true
  end
  #extended block B
  if char >= 0x20000 && char <= 0x2A6DF
    return true
  end
  #extended block C
  if char >= 0x2A700 && char <= 0x2B73F
    return true
  end
  return false
end
Louisiana answered 28/4, 2010 at 9:48 Comment(6)
@sam they are the CJK ranges. These are the Chinese, Japanese, and Korean characters (assuming the ranges are correct, which I believe they are)Athematic
@Michael Lowman, they returned false for a few characters I tested using Korean, Japanese and all 1..9 a..z not to mention they do return true for Chinese. How could I go about checking if the character is traditional or simplified.Bulwerlytton
Also, where did these ranges come from. unihan? what specific page?Bulwerlytton
On the mentioned wikipedia page each of the blocks has a list of charts with the characters it contains. I used those ranges.Louisiana
is it possible to distinguish between traditional and simplified forms?Bulwerlytton
Not very easily, but this library does this trick. Ruby 1.9+ only. github.com/jpatokal/script_detectorRodrigo

© 2022 - 2024 — McMap. All rights reserved.