Ruby: Checking for East Asian Width (Unicode)
Asked Answered
N

3

6

Using Ruby, I have to output strings in an columnar format to the terminal. Something like this:

| row 1     | a string here     | etc
| row 2     | another string    | etc

I can do this fine with Latin UTF8 characters using String#ljust and %s.

But a problem arises when the characters are Korean, Chinese, etc. The columns simply won't align when there are rows of English interspersed with rows containing Korean, etc.

How can I get column alignment here? Is there a way to output Asian characters in the equivalent of a fixed-width font? How about for documents that are meant to be displayed and edited in Vim?

Nod answered 13/1, 2011 at 15:53 Comment(6)
Using vim, you have the 'guifontwide' setting that enables you to choose a double-width font for asian text.Ogre
Your choice of words is very poor. Asia is a very big place with many countries, languages and writing systems.Gan
@Nod It is the problem with only korean/chinese/...{asian} languages or with any character whose len() is greater then 1 (for example, «). If it is, then to get real length of the text, use len(split(str, '\zs')) instead of len(str) (vim-7.2, strwidth(str) on vim-7.3).Heatherheatherly
@ZyX: No, this is a problem with CJK full-width characters, not about non-ASCII characters like "«".Dipole
@dan: I'd suggest renaming your question to something like "Ruby: Checking for East Asian Width (Unicode)", so people will be able to find it on google.Dipole
Thanks for that suggestion. Followed it.Nod
V
1

Late to the party, but hopefully still helpful: In Ruby, you can use the unicode-display_width gem to check for a string's east-asian-width:

require 'unicode/display_width'
"⚀".display_width #=> 1
'一'.display_width #=> 2
Vardar answered 11/12, 2013 at 8:45 Comment(0)
D
3

Your problem happens with CJK (Chinese/Japanese/Korean) full-width and wide characters (also scroll down for diagrams); those characters occupy two fixed-width cells. String#ljust and friends don't take this into account.

There is unicodedata.east_asian_width in Python, which would allow you to write your own width-aware ljust, but it doesn't seem to exist in Ruby. The best I've been able to find is this blog post: http://d.hatena.ne.jp/hush_puppy/20090227/1235740342 (machine translation). If you look at the output at the bottom of the original, it seems to do what you want, so maybe you can reuse some of the Ruby code.

Or if you're only printing full-width characters (i.e. you're not mixing half-width and full-width), you can be lazy and just use full-width forms of everything, including the spacing and the box drawing. Here's a couple characters you can copy and paste:

  • | (full-width vertical bar)
  •   (full-width space)
  • - (full-width dash; does not get rendered nicely in my terminal font)
  • ー (another full-width dash)
Dipole answered 13/1, 2011 at 19:22 Comment(3)
I tested how your strwidth function behaves with this characters and found that strwidth("|") returns 2, not 1. I don't know, however, how to check width in ruby.Heatherheatherly
Reading your earlier comment, I guess you're referring to Vim 7.3's strwidth function? Then apparently it doesn't take full-width characters into account (I never tried to claim that, in case you got the impression ^^). The vertical bar I posted is definitely full-width, in any case.Dipole
@Jo Liss I actually said that it does take full-width characters into account (with normal bar or utf-8 table border it will return 1).Heatherheatherly
V
1

Late to the party, but hopefully still helpful: In Ruby, you can use the unicode-display_width gem to check for a string's east-asian-width:

require 'unicode/display_width'
"⚀".display_width #=> 1
'一'.display_width #=> 2
Vardar answered 11/12, 2013 at 8:45 Comment(0)
R
0

Late to the party, but you can try east_asian_width_simple.

require 'east_asian_width_simple'
eaw = EastAsianWidthSimple.new(File.open('EastAsianWidth.txt'))
eaw.string_width('台灣 No.1') # => 9
eaw.string_width('No code, no 🐞') # => 14

It aims be fast and flexible.

Fast

east_asian_width_simple is faster than other pure Ruby implementations. Below is the comparison table of time cost:

Gem Width Calculation Property Lookup
east_asian_width_simple 1x 1x
east_asian_width v0.0.2 8.78x 4.57x
reline v0.3.1 10.25x -
unicode-display_width v2.1.0 4.45x -
unicode-eaw v2.2.0 - 10.60x
visual_width v0.0.6 2.03x -

Flexible

east_asian_width_simple is flexible that it decouples the East Asian Width Property Data File.

Unlike other gems, you update by downloading the latest property file from unicode.org instead of upgrading the gem.

For example, the latest data file draft version is v15.0.0d5 but no other gem can not apply it without releasing a new gem version.

Rammer answered 12/6, 2022 at 4:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.