Python not sorting unicode properly. Strcoll doesn't help
Asked Answered
C

8

27

I've got a problem with sorting lists using unicode collation in Python 2.5.1 and 2.6.5 on OSX, as well as on Linux.

import locale   
locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]

Which should print:

[u'a', u'ą', u'z']

But instead prints out:

[u'a', u'z', u'ą']

Summing it up - it looks as if strcoll was broken. Tried it with various types of variables (fe. non-unicode encoded strings).

What do I do wrong?

Best regards, Tomasz Kopczuk.

Conoid answered 5/8, 2010 at 8:25 Comment(5)
What does locale.getlocale(LC_COLLATE) return after your setlocale line?Prodrome
The locale module uses the locale API from the C library, so if there is an error it must be in the C library. An equivalent test with locale de_DE.UTF-8 and string ä instead of ą works correctly. Even if I use the German locale with ą the order is correct, so there must be something wrong with the Polish locale implementation in the C library. As a workaround you can convert the string to normalization form D using unicodedata.normalize, then even the naive strcmp ordering should work.Clausen
OK, I'm interested in this too. I tried it with pl_PL.UTF-8 and de_DE.UTF-8, and also with sort(key=locale.strxfrm) instead of using strcoll also on OS X and for the moment am getting your incorrect result. Sting ä with de_DE.UTF8 did not work for me.Bisulcate
Works for me on Linux but not Mac. Maybe OS X's collation tables are wrong, or something? FWIW POSIX locales are dodgy for webapps are they're per-process, not thread safe.Veneer
+1 Works for me on Linux (Ubuntu) but neither on Mac nor FreeBSD.Caddric
C
18

Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (PyICU on PyPI).

On OS X: sudo port install py26-pyicu, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).

PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))
print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]

which gives:

[u'a', u'ą', u'z']

Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.

Conoid answered 5/8, 2010 at 9:37 Comment(3)
Good question, and good answer -- and you're ahead of everyone by a few steps, which is no wonder if you're in Poland :) . Anyhow, this is the second time I've seen issues with Python where it relies on underlying C libraries. Do you know where these could be brought up?Bisulcate
I think it might be a problem with the libraries themselves, rather than Python. But as gnibbler pointed out - it happens to work in some OSes, so maybe, at least this particular issue, has been fixed at some point. OS X is famous for using old gcc and so, and the other OS I tested was Fedora 8 - which itself is not quite contemporary. I would bring this up at one of the mailing lists for the underlying C libraries. Cheers mate :)Conoid
I agree. I made a Gist gist.github.com/509520 and will give it to a few people to try out. I love i18n, but the bugs make it tedious.Bisulcate
B
7

Just to add to tkopczuk's investigation: This is definitely a gcc bug, at least for version 4.2.1 on OS X 10.6.4. It can be reproduced by calling C strcoll() directly as in this snippet.

EDIT: Still on the same system, I find that for the UTF-8 versions of de_DE, fr_FR, pl_PL, the problem is there, but for the ISO-88591 versions of fr_FR and de_DE, sort order is correct. Unfortunately for the OP, ISO-88592 pl_PL is also buggy:

The order for Polish ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.

The order for Polish Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, UTF8.

The order for German Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH DIAERESIS
The LC_COLLATE culture and encoding settings were de_DE, UTF8.

The order for German ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER A WITH DIAERESIS
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.

The order for Fremch ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.

The order for French Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER E WITH ACUTE
The LC_COLLATE culture and encoding settings were fr_FR, UTF8.
Bisulcate answered 5/8, 2010 at 23:9 Comment(5)
Is it possible to decompile /usr/share/locale/pl_PL.UTF-8/LC_COLLATE to some sort of readable form? Might not be a gcc bug after all, but wrong collation tables, as @Veneer pointed out.Conoid
Well, I get the same behaviour for German and French (ie, characters with diacritics are sorted after "z"), so it's not just the Polish collation tables. I wonder if it doesn't just pick C locale or maybe the default locale (mine is en_GB -- is yours pl_PL?). In any event, it's clearly in the C library, whether in the data or in the code I can't tell.Bisulcate
Yup, mine is pl_PL. But it would be nice to check the collation tables and if they're kosher, then there's the problem with different locale settings being used by the library. But I guess it's the library, hence the problems on various OSes.Conoid
I don't know about how the platform-specific collation tables are made, except that they're supposed to be made from the Common Locale Repository cldr.unicode.org . The more I look into this, the more I think the C library is a very minimal way to account for locale anyway, and that you're better off using ICU for serious work. Above more testing -- de_DE and fr_FR ISO locales are ok, but pl_PL is also buggy for ISO.Bisulcate
This problem seems to apply to the other German locales as well – i.e. de_AT, de_CH in addition to de_DE – in both their "standalone" and UTF-8 versions. ISO8859-1, ISO8859-15 seem fine. Operating system: OS X 10.10.5 (Yosemite)Autoplasty
S
5

Here is how i managed to sort Persian language correctly (without PyICU)(using python 3.x):

First set the locale (don't forget to import locale and platform)

if platform.system() == 'Linux':
    locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')
elif platform.system() == 'Windows':
   locale.setlocale(locale.LC_ALL, 'Persian_Iran.1256')
else:
   pass (or any other OS)

Then sort using key:

a = ['ا','ب','پ','ت','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ک','گ','ل','م','ن','و','ه','ي']

print(sorted(a,key=locale.strxfrm))

For list of Objects:

a = [{'id':"ا"},{'id':"ب"},{'id':"پ"},{'id':"ت"},{'id':"ث"},{'id':"ج"},{'id':"چ"},{'id':"ح"},{'id':"خ"},{'id':"د"},{'id':"ذ"},{'id':"ر"},{'id':"ز"},{'id':"ژ"},{'id':"س"},{'id':"ش"},{'id':"ص"},{'id':"ض"},{'id':"ط"},{'id':"ظ"},{'id':"ع"},{'id':"غ"},{'id':"ف"},{'id':"ق"},{'id':"ک"},{'id':"گ"},{'id':"ل"},{'id':"م"},{'id':"ن"},{'id':"و"},{'id':"ه"},{'id':"ي"}]

print(sorted(a, key=lambda x: locale.strxfrm(x['id']))

Finally you can return the locale:

locale.setlocale(locale.LC_ALL, '')
Sharlasharleen answered 25/5, 2016 at 14:47 Comment(0)
H
4

@gnibbler, using PyICU with the sorted() function does work in a Python3 Environment. After a little digging through the ICU API documentation and some experimentation, I came across the getSortKey() function:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('de_DE.UTF-8'))
sorted(['a','b','c','ä'],key=collator.getSortKey)

which produces the desired collation:

['a', 'ä', 'b', 'c']

instead of the undesired collation:

sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
Hemimorphite answered 22/5, 2013 at 20:49 Comment(0)
S
2
import locale
from functools import cmp_to_key
iterable = [u'a', u'z', u'ą']
sorted(iterable, key=cmp_to_key(locale.strcoll))  # locale-aware sort order

(Ref.: http://docs.python.org/3.3/library/functools.html)

Si answered 10/7, 2013 at 15:12 Comment(0)
T
2

Since 2012 there's been a library natsort. It includes amazing functions such as natsorted and humansorted. More importantly, they work not only with lists!. Code:

from natsort import natsorted, humansorted

lst = [u"a", u"z", u"ą"]
dct = {"ą": 1, "ż": 3, "Ż": 4, "b": 5}

lst_natsorted = natsorted(lst)
lst_humansorted = humansorted(lst)
dct_natsorted = dict(natsorted(dct.items()))
dct_humansorted = dict(humansorted(dct.items()))

print("List natsorted: ", lst_natsorted)
print("List humansorted: ", lst_humansorted, "\n")
print("Dictionary natsorted: ", dct_natsorted)
print("Dictionary humansorted: ", dct_humansorted)

Output:

List natsorted:  ['a', 'ą', 'z']
List humansorted:  ['a', 'ą', 'z']

Dictionary natsorted:  {'Ż': 4, 'ą': 1, 'b': 5, 'ż': 3}  
Dictionary humansorted:  {'ą': 1, 'b': 5, 'ż': 3, 'Ż': 4}

As you can see results differ when sorting dictionaries but considering given list both results are correct.

By the way, this library is also great to sort strings containing numbers:

from natsort import natsorted, humansorted

lst_mixed = ["a9", "a10", "a1", "c4", "c40", "c5"]

mixed_sorted = sorted(lst_mixed)
mixed_natsorted = natsorted(lst_mixed)
mixed_humansorted = humansorted(lst_mixed)

Output:

List with mixed strings sorted:  ['a1', 'a10', 'a9', 'c4', 'c40', 'c5']
List with mixed strings natsorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
List with mixed strings humansorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
Tenotomy answered 17/11, 2021 at 17:4 Comment(0)
A
0

On ubuntu lucid the sorting with cmp seems to work ok, but my output encoding is wrong.

>>> import locale   
>>> locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
'pl_PL.UTF-8'
>>> print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]
[u'a', u'\u0105', u'z']

Using key with locale.strxfrm does not work unless I am missing something

>>> print [i for i in sorted([u'a', u'z', u'ą'], key=locale.strxfrm)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 0: ordinal not in range(128)
Anticline answered 5/8, 2010 at 9:33 Comment(3)
With strxfrm You have to manually decode the unicode string AFAIK.Conoid
@tkopczuk, It would be nice to find a way to sort using key as cmp for sorted is gone in Python3Anticline
It seems to be working fine with the provided functools.cmp_to_key function (from functools import cmp_to_key), like that: sorted([u'a', u'z', u'ą'], key=cmp_to_key(collator.compare))Conoid
A
0

An old question, but some clarifications are required. For locale sensitive sorting in Python, two approaches are available. Which approach you take, depends on what operating system you are using.

First approach is to use the in-built locale module. This will depend on what operating system you are on, and what locales are available.

import locale
locale.setlocale(locale.LC_COLLATE, 'pl_PL.UTF-8')
test_list = ['a', 'z', 'ą']
sorted(test_list, key=locale.strxfrm)

If I am using a version of Linux using glibc, I will get ['a', 'ą', 'z'].

If I am using a version of Linux using Musl libc, or a Linux distro developed for embedded systems, I will get ['a', 'z', 'ą'], i.e. locale sensitive sorting is unsupported.

If I am using a system based on BSD libc (like macOS), I will get ['a', 'z', 'ą'].

On macOS, if you run the following command:

ls -al  /usr/share/locale/pl_PL/LC_COLLATE

you get /usr/share/locale/pl_PL/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE, i.e. The Polish collation table is symlinked to another collation table, creating a language insensitive sort. This is similar to other BSD libc derived system, where priority was given to stable locale independent sorting in the filesystem.

The second approach, for systems with icu4c installed, is to use PyICU. ICU4C uses the Common Locale Data Repository (CLDR). CLDR locale data is more extensive than locale data in libc based implementations.

import icu
collator = icu.Collator.createInstance(icu.Locale('pl'))
sorted(test_list, key=collator.getSortKey)

Which gives ['a', 'ą', 'z'].

Locale data varies across implementations, this just doesn't affect sorting, but can be seen in other locale sensitive operations as well.

Altarpiece answered 24/9, 2023 at 4:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.