I am trying to insert spaces into a string of IPA characters, e.g. to turn ɔ̃wɔ̃tɨ
into ɔ̃ w ɔ̃ t ɨ
. Using split/join was my first thought:
s = ɔ̃w̃ɔtɨ
s.split('').join(' ') #=> ̃ ɔ w ̃ ɔ p t ɨ
As I discovered by examining the results, letters with diacritics are in fact encoded as two characters. After some research I found the UnicodeUtils module, and used the each_grapheme method:
UnicodeUtils.each_grapheme(s) {|g| g + ' '} #=> ɔ ̃w ̃ɔ p t ɨ
This worked fine, except for the inverted breve mark. The code changes ̑a
into ̑ a
. I tried normalization (UnicodeUtils.nfc
, UnicodeUtils.nfd
), but to no avail. I don't know why the each_grapheme
method has a problem with this particular diacritic mark, but I noticed that in gedit, the breve is also treated as a separate character, as opposed to tildes, accents etc. So my question is as follows: is there a straightforward method of normalization, i.e. turning the combination of Latin Small Letter A
and Combining Inverted Breve
into Latin Small Letter A With Inverted Breve
?
puts UnicodeUtils.each_grapheme("ɔ̃ȃɨ").to_a.join(' ')
it did output"ɔ̃ ȃ ɨ"
correctly. – Lees"̑a".gsub("\u0311\u0061", "\u0203")
. The"̑a"
at the start is the one made up from U+0311 and U+0061. Thegsub
will replace it with the single character, proper version (technically you can usesub
but if you want to replace all occurrences in a larger text, usegsub
). – Lees̑a
is transformed intȏ a
, but the character should appear asȃ
, so I suspect your example has the combining mark before the base character, which would explain the behaviour you see. – Condensable