Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

About

Asked 8/12, 2013 at 20:44 Answered 19/12, 2013 at 14:53

Solved r unicode encoding unicode-normalization latin

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.

R renders the following (version 3.0.2, Mac OS 10.7.5):

> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"

However, of course:

> "\u00e9" == "\u0065\u0301"
[1] FALSE

Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301" into "\u00e9".

That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.

Thanks a lot in advance.

Mullin answered 8/12, 2013 at 20:44 Comment(3)

You might want to post your edit as an answer. That way the question shows up as answered. Also, IIRC, you should convert to form D, not C, since the combined characters are a bit of a hack. – Scleroprotein 9/12, 2013 at 1:27

Thanks! You might be right about form D in general, though until now form C has seemed more adapted to my practice (e.g. iconv("\u0065\u0301", to="ASCII//TRANSLIT") gives NA, whereas iconv(normalize_C("\u0065\u0301"), to="ASCII//TRANSLIT") gives "'e" ; and plot prints the labels better in form C). I will try to learn more about the pros and cons. – Mullin 9/12, 2013 at 16:44

@Mullin you saved my day - thanks ! – Columbuscolumbyne 2/3, 2016 at 17:17

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

Mullin answered 19/12, 2013 at 14:53 Comment(1)

Since i found your answer when googling for umlaut replacement on german OSX in R via iconv(), it might be worthy that with the stringi package it simply is one function: stringi::stri_trans_general(c("äöüø"),"latin-ascii") – Riffraff 12/1, 2016 at 14:12

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags