In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.
R renders the following (version 3.0.2, Mac OS 10.7.5):
> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"
However, of course:
> "\u00e9" == "\u0065\u0301"
[1] FALSE
Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301"
into "\u00e9"
.
That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv
-- at least for the usual Latin1 characters -- and is better handled by plot
.
Thanks a lot in advance.
iconv("\u0065\u0301", to="ASCII//TRANSLIT")
givesNA
, whereasiconv(normalize_C("\u0065\u0301"), to="ASCII//TRANSLIT")
gives"'e"
; andplot
prints the labels better in form C). I will try to learn more about the pros and cons. – Mullin