How to handle Combining Diacritical Marks with UnicodeUtils? - McMap

About

How to handle Combining Diacritical Marks with UnicodeUtils?

Asked 26/5, 2014 at 15:51 Answered 6/12, 2014 at 8:41

Solved ruby unicode diacritics unicode-normalization phonetics

P

1

1

I am trying to insert spaces into a string of IPA characters, e.g. to turn ɔ̃wɔ̃tɨ into ɔ̃ w ɔ̃ t ɨ. Using split/join was my first thought:

s = ɔ̃w̃ɔtɨ
s.split('').join(' ') #=> ̃ ɔ w ̃ ɔ p t ɨ

As I discovered by examining the results, letters with diacritics are in fact encoded as two characters. After some research I found the UnicodeUtils module, and used the each_grapheme method:

UnicodeUtils.each_grapheme(s) {|g| g + ' '} #=> ɔ ̃w ̃ɔ p t ɨ

This worked fine, except for the inverted breve mark. The code changes ̑a into ̑ a. I tried normalization (UnicodeUtils.nfc, UnicodeUtils.nfd), but to no avail. I don't know why the each_grapheme method has a problem with this particular diacritic mark, but I noticed that in gedit, the breve is also treated as a separate character, as opposed to tildes, accents etc. So my question is as follows: is there a straightforward method of normalization, i.e. turning the combination of Latin Small Letter A and Combining Inverted Breve into Latin Small Letter A With Inverted Breve?

Puerility answered 26/5, 2014 at 15:51 Comment(5)

Are you sure you are using the right ȃ? I have no experience with these characters but copied the one from the Wikipedia page with Unicode +U0203 and when I ran puts UnicodeUtils.each_grapheme("ɔ̃ȃɨ").to_a.join(' ') it did output "ɔ̃ ȃ ɨ" correctly. – Lees 26/5, 2014 at 16:12

In fact, I am sure I use the wrong ȃ. I read input data from a file and have no control over its content. I would like to replace all occurences of ȃ (U+0311 + U+0061) with correct version (U+0203). Maybe it is possible to do it in the text editor, but I don't know how. – Puerility 26/5, 2014 at 16:51

You can replace it as follows: "̑a".gsub("\u0311\u0061", "\u0203"). The "̑a" at the start is the one made up from U+0311 and U+0061. The gsub will replace it with the single character, proper version (technically you can use sub but if you want to replace all occurrences in a larger text, use gsub). – Lees 26/5, 2014 at 16:56

Thank you for your help. Unfortunately I found more "standalone" diacritic marks in the text. There were too many combinations to use gsub, so I decided to simply remove all Combining Diacritical Marks. It is a rather lame workaround (and I lost some phonetic information), but I cannot think of anything better. Maybe there is a way to check whether a character is a combining one? – Puerility 26/5, 2014 at 18:23

Note the combining character should be after the base character. In your question you say ̑a is transformed into ̑ a, but the character should appear as ȃ, so I suspect your example has the combining mark before the base character, which would explain the behaviour you see. – Condensable 26/5, 2014 at 19:0

V

0

I understand your question concerns Ruby but I suppose the problem is about the same as with Python. A simple solution is to test the combining diacritical marks explicitly :

import unicodedata
liste=[]
s = u"ɔ̃w̃ɔtɨ"
comb=False
prec=u""
for char in s:
    if unicodedata.combining(char):
        liste.append(prec+char)
        prec=""
    else:
        liste.append(prec)
        prec=char
liste.append(prec)
print " ".join(liste)
>>>>  ɔ̃  w̃  ɔ t ɨ

Vientiane answered 6/12, 2014 at 8:41 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.