Using unicodedata.normalize in Python 2.7
Asked Answered
A

3

8

Once again, I am very confused with a unicode question. I can't figure out how to successfully use unicodedata.normalize to convert non-ASCII characters as expected. For instance, I want to convert the string

u"Cœur"

To

u"Coeur"

I am pretty sure that unicodedata.normalize is the way to do this, but I can't get it to work. It just leaves the string unchanged.

>>> s = u"Cœur"
>>> unicodedata.normalize('NFKD', s) == s
True

What am I doing wrong?

Argentiferous answered 17/10, 2012 at 22:57 Comment(0)
C
6

Your problem seems not to have to do with Python, but that the character you are trying to decompose (u'\u0153' - 'œ') is not a composition itself.

Check as your code works with a string containing normal composite characters like "ç" and "ã":

>>> a1 = a
>>> a = u"maçã"
>>> for norm in ('NFC', 'NFKC', 'NFD','NFKD'):
...    b = unicodedata.normalize(norm, a)
...    print b, len(b)
... 
maçã 4
maçã 4
maçã 6
maçã 6

And then, if you check the unicode reference for both characters (yours and c + cedila) you will see that the later has a "decomposition" specification the former lacks:

http://www.fileformat.info/info/unicode/char/153/index.htm
http://www.fileformat.info/info/unicode/char/00e7/index.htm

It like "œ" is not formally equivalent to "oe" - (at least not for the people who defined this unicode part) - so, the way to go to normalize text containing this is to make a manual replacement of the char for the sequence with unicode.replace - as hacky as it sounds.

Claudy answered 17/10, 2012 at 23:32 Comment(1)
Actually, I'm not sure unicodedata.normalize was what I wanted. But I did figure out a workaround.Argentiferous
S
21

You could try Unidecode:

# -*- coding: utf-8 -*-
from unidecode import unidecode # $ pip install unidecode

print(unidecode(u"Cœur"))
# -> Coeur
Stapes answered 18/10, 2012 at 4:28 Comment(1)
This is a great answer. Short and simple and gets the job done.Marchpane
C
6

Your problem seems not to have to do with Python, but that the character you are trying to decompose (u'\u0153' - 'œ') is not a composition itself.

Check as your code works with a string containing normal composite characters like "ç" and "ã":

>>> a1 = a
>>> a = u"maçã"
>>> for norm in ('NFC', 'NFKC', 'NFD','NFKD'):
...    b = unicodedata.normalize(norm, a)
...    print b, len(b)
... 
maçã 4
maçã 4
maçã 6
maçã 6

And then, if you check the unicode reference for both characters (yours and c + cedila) you will see that the later has a "decomposition" specification the former lacks:

http://www.fileformat.info/info/unicode/char/153/index.htm
http://www.fileformat.info/info/unicode/char/00e7/index.htm

It like "œ" is not formally equivalent to "oe" - (at least not for the people who defined this unicode part) - so, the way to go to normalize text containing this is to make a manual replacement of the char for the sequence with unicode.replace - as hacky as it sounds.

Claudy answered 17/10, 2012 at 23:32 Comment(1)
Actually, I'm not sure unicodedata.normalize was what I wanted. But I did figure out a workaround.Argentiferous
M
3

As jsbueno says, some letters just don't have a compatibility decomposition.

You can use the Unicode CLDR Latin-ASCII transform to generate a mapping of manual replacements.

Memnon answered 18/10, 2012 at 4:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.