cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

Asked 15/1, 2018 at 22:57 Answered 16/1, 2018 at 9:26

Solved ruby-on-rails ruby string utf-8 unicode-normalization

In Ruby, Javascript and Java (others I didn't try), have cyrillic chars Я̆ Я̄ Я̈ length 2. When I try to check length of string with these chars indside, I get bad output value.

"Я̈".mb_chars.length
#=> 2  #should be 1 (ruby on rails)

"Я̆".length
#=> 2  #should be 1 (ruby, javascript)

"Ӭ".length
#=> 1  #correct (ruby, javascript)

Please note, that strings are encoded in UTF-8 and each char behave as single character.

My question is why is there such behaviour and how can I get length of string correctly with these chars inside?

Orison answered 15/1, 2018 at 22:57 Comment(5)

In your example I'm seeing "Я̈ " which has a space in it, same with the second example, but not the third. Check with "Я̈ ".chars which gives ["Я", "̈", " "] for me, the accent as a separate char. – Mingo 15/1, 2018 at 23:4

Thank you but I think this is caused by editor here in stackowerflow. When you try to copy these chars in terminal, browser etc. It's just one char. – Orison 15/1, 2018 at 23:7

It displays correctly (shows one character) in my terminal, with the same result: 2.4.2 :002 > 'Я̆'.length => 2 – Ignescent 15/1, 2018 at 23:8

You might be interested in something like this #22277552, which I think is sort of related to your issue. – Ignescent 15/1, 2018 at 23:11

Actually, the referenced https://mcmap.net/q/63342/-how-does-zalgo-text-work might be better – Ignescent 15/1, 2018 at 23:11

The underlying problem is that Я̈ is actually two code points: the Я and the umlaut are separate:

'Я̈'.chars
#=> ["Я", "̈"]

Normally you'd solve this sort of problem through unicode normalization but that alone won't help you here as there is no single code point for Я̈ or Я̆ (but there is for Ӭ).

You could strip off the diacritics before checking the length:

'Я̆'.gsub(/\p{Diacritic}/, '')
#=> "Я" 
'Я̆'.gsub(/\p{Diacritic}/, '').length
#=> 1

You'll get the desired length but the strings won't be quite the same. This also works on things like Ӭ which can be represented by a single code point:

'Ӭ'.length
#=> 1
'Ӭ'.gsub(/\p{Diacritic}/, '')
#=> "Ӭ" 
'Ӭ'.gsub(/\p{Diacritic}/, '').length
#=> 1

Unicode is wonderful and awesome and solves many problems that used to plague us. Unfortunately, Unicode is also horrible and complicated because human languages and glyphs weren't exactly designed.

Karie answered 16/1, 2018 at 0:38 Comment(2)

What is the exact purpose of NFD before gsubbing the diacritics out, besides it’s the educational example? I believe the decomposition could be safely removed from the chain. – Maddox 16/1, 2018 at 5:20

@mudasobwa Mostly to remind me not to be thinking about two different things at once (only one of which was related to the question). Thanks for reminding me. – Karie 16/1, 2018 at 6:37

Ruby 2.5 adds String#each_grapheme_cluster:

'Я̆Я̄Я̈'.each_grapheme_cluster.to_a   #=> ["Я̆", "Я̄", "Я̈"]
'Я̆Я̄Я̈'.each_grapheme_cluster.count  #=> 3

Note that you can't use each_grapheme_cluster.size which is equivalent to each_char.size, so both would return 6 in the above example. (That looks like a bug, I've just filed a bug report)

Fontana answered 16/1, 2018 at 7:55 Comment(0)

Try unicode-display_width which is built to give an exact answer to this question:

require "unicode/display_width"
Unicode::DisplayWidth.of "Я̈" #=> 1

Saucepan answered 16/1, 2018 at 9:26 Comment(0)

Recommended topics

Hot tags