How do I sort unicode strings alphabetically in Python?
Asked Answered
F

11

114

Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python?

Is there a library for this? I couldn't find anything. Preferrably sorting should have language support so it understands that åäö should be sorted after z in Swedish, but that ü should be sorted by u, etc. Unicode support is thereby pretty much a requirement.

If there is no library for it, what is the best way to do this? Just make a mapping from letter to a integer value and map the string to a integer list with that?

Fanning answered 8/7, 2009 at 12:59 Comment(4)
Note that this is even more locale dependent: In Swedish (as you state) "Ä" comes after after "Z", but in German, "Ä" is usually sorted as "AE".Proton
@Georg: Was there a reason you opened a bounty on this? The locale.strcoll answer is correct when you need Unicode sorting using the user's locale, and the ICU answer what you want when you need more than that (collation using more than one locale). Most of the time, you want locale.strcoll.Rosado
@Glenn: I wanted to know how well locale.strcoll works and especially what ICU does better than the Python function. Basically some more attention for the question.Stipulation
@Georg: I’ve been playing around a lot with the Unicode Collation Algorithm lately, as you can see from my answer. It is truly excellent to be able, for example, to sort on --locale=de__phonebook when you need it. The Perl module passes the UCA test suite, and the script I provided makes it a lot easier to play with the whole UCA plus all its options including locales, just from the command line. Might not answer the question, but it should still be highly interesting. If you’re in Switzerland, I am sure you could use the flexibility. :)Shantell
B
87

IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.

Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651.

The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.

>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']
Barbera answered 8/7, 2009 at 13:42 Comment(4)
Does this work the same for Python 2 and Python 3? I used locale.strxfrm from the answer by u0b34a0f6ae and it seems to work and is much more elegant and does not require any additional software.Magdamagdaia
Doesn't work with Python3 for me, sudo pip3 install PyICU fails to install and so does for Python2.Unsnarl
I had to install libicu-devel.x86_64 for pyICU to compile and install from Pip. It works, although the output from the last 'sorted' command is: ['a', '\xc3\xa4', 'b', 'c']Allanite
You will need to install libicu-devel and build-essentials (for Debian systems) so that pip installing "pyicu" works. This is because "pyicu" only offers source distributions (tarballs) in PyPI, so they need to be compiled to wheel files when installing it.Theis
J
59

I don't see this in the answers. My Application sorts according to the locale using python's standard library. It is pretty easy.

# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")

corpus.sort(cmp=locale.strcoll)

# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)

Question to Lennart and other answerers: Doesn't anyone know 'locale' or is it not up to this task?

Jassy answered 23/8, 2009 at 14:32 Comment(14)
By the way 1) I don't thinkn locale.strxfrm is broken for UTF-8 encoded `str'; I benchmarked by application and concluded that using cmp=strcoll on unicode objects is cheaper than decoding all to UTF-8 and using key=strxfrmJassy
By the way 2) The locale module will only work with your generated locales (for a Linux box), not any arbitrary locale. "locale -a" will tell you whichJassy
Can anyone confirm if this works? If yes, why are there so many answers using external libraries?Stipulation
@Georg: it's the Law of the Instrument, if you have a hammer at hand, every problem seems like a nail ;-) ... locale.strcoll should work fine.Headlong
@Georg: I believe that locale only supports a simple substring->collating_element mapping. It does not handle things like expansions (æ sorted as "ae"), French accent sorting (letters sorted left-to-right, but accents right-to-left), rearrangement and probably a few more. Details here (full UCA feature set): unicode.org/reports/tr10 and here (locale collation): chm.tu-dresden.de/edv/manuals/aix/files/aixfiles/LC_COLLATE.htmHighstepper
@Rafal: Wow, there seem to be a lot more special cases than I thought.Stipulation
@Rafał: It does seem to handle "æ" fine for both US and Swedish collation, though. But pyICO does more things, and is pretty cool. I also don't know how strcoll would work on Windows.Fanning
@Lennart Regebro: This is interesting. Do those collations stipulate that æ is sorted as "ae"? That's two characters, not just one character that fits between "a" and "b": ab<æ<af.Highstepper
@Rafał: Yes. In en_US it's 'ae', in sv_SE it's 'ä', which are both correct. strcoll does not just do character by character comparison, it's more complex than that. Look at what strxfrm() does in for example.Fanning
@Lennart: I thought that locale only handles contractions (not expansions) but I stand corrected.Highstepper
To clearly answer the question: Yes it is up to the task. There are apparently some special cases that the complete Unicode Collation Algorithm handles better, but unless you already knew that chances are you won't notice.Fanning
The biggest problem here is: you have to set the locale globally for the whole application. – You can’t just have it for the comparison at hand.Alternation
v=[u"áàAâÂbBcCçÇdDeéEf", u"asb", u"bab"] sorted(v, cmp=locale.strcoll) [u'asb', u'bab', u'\xe1\xe0A\xe2\xc2bBcC\xe7\xc7dDe\xe9Ef'] Doesn't workRoadbed
This does not work, really, also in some languages there are multiple standards of sorting. SV: ['Art', 'Ved', 'Wasa', 'Älg'] DE should be: ['Älg', 'Art', 'Ved', 'Wasa']. The SV-type ordering in DE is rare.Amyl
C
10

Try James Tauber's Python Unicode Collation Algorithm. It may not do exactly as you want, but seems well worth a look. For a bit more information about the issues, see this post by Christopher Lenz.

Cand answered 8/7, 2009 at 13:8 Comment(2)
That at least fixes the generic issue. I guess language sensitive versions of the collation list could be created too.Fanning
This doesn't let you specify locale, and the reference config file causes a ValueError.Teasel
A
10

You might also be interested in pyuca:

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Though it is certainly not the most exact way, it is a very simple way to at least get it somewhat right. It also beats locale in a webapp as locale is not threadsafe and sets the language settings process-wide. It also easier to set up than PyICU which relies on an external C library.

I uploaded the script to github as the original was down at the time of this writing and I had to resort to web caches to get it:

https://github.com/href/Python-Unicode-Collation-Algorithm

I successfully used this script to sanely sort German/French/Italian text in a plone module.

Ammoniac answered 16/12, 2011 at 13:54 Comment(1)
+1 for pyuca. It's fairly fast (3 seconds to sort 28000 words), is pure python, and requires no dependency.Washboard
F
8

A summary and extended answer:

locale.strcoll under Python 2, and locale.strxfrm will in fact solve the problem, and does a good job, assuming that you have the locale in question installed. I tested it under Windows too, where the locale names confusingly are different, but on the other hand it seems to have all locales that are supported installed by default.

ICU doesn't necessarily do this better in practice, it however does way more. Most notably it has support for splitters that can split texts in different languages into words. This is very useful for languages that doesn't have word separators. You'll need to have a corpus of words to use as a base for the splitting, because that's not included, though.

It also has long names for the locales so you can get pretty display names for the locale, support for other calendars than Gregorian (although I'm not sure the Python interface supports that) and tons and tons of other more or less obscure locale supports.

So all in all: If you want to sort alphabetically and locale-dependent, you can use the locale module, unless you have special requirements, or also need more locale dependent functionality, like words splitter.

Fanning answered 16/2, 2011 at 7:0 Comment(0)
A
7

I see the answers have already done an excellent job, just wanted to point out one coding inefficiency in Human Sort. To apply a selective char-by-char translation to a unicode string s, it uses the code:

spec_dict = {'Å':'A', 'Ä':'A'}

def spec_order(s):
    return ''.join([spec_dict.get(ch, ch) for ch in s])

Python has a much better, faster and more concise way to perform this auxiliary task (on Unicode strings -- the analogous method for byte strings has a different and somewhat less helpful specification!-):

spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)

def spec_order(s):
    return s.translate(spec_dict)

The dict you pass to the translate method has Unicode ordinals (not strings) as keys, which is why we need that rebuilding step from the original char-to-char spec_dict. (Values in the dict you pass to translate [as opposed to keys, which must be ordinals] can be Unicode ordinals, arbitrary Unicode strings, or None to remove the corresponding character as part of the translation, so it's easy to specify "ignore a certain character for sorting purposes", "map ä to ae for sorting purposes", and the like).

In Python 3, you can get the "rebuilding" step more simply, e.g.:

spec_dict = ''.maketrans(spec_dict)

See the docs for other ways you can use this maketrans static method in Python 3.

Actinomycosis answered 8/7, 2009 at 14:57 Comment(1)
This method is nice but does not allow you to place á between az and bAmyl
P
1

To implement it you will need to read about "Unicode collation algorithm" see http://en.wikipedia.org/wiki/Unicode_collation_algorithm

http://www.unicode.org/unicode/reports/tr10/

a sample implementation is here

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Pablopabon answered 8/7, 2009 at 13:13 Comment(0)
S
1

A Complete UCA Solution

The simplest, easiest, and most straightforward way to do this it to make a callout to the Perl library module, Unicode::Collate::Locale, which is a subclass of the standard Unicode::Collate module. All you need do is pass the constructor a locale value of "xv" for Sweden.

(You may not neccesarily appreciate this for Swedish text, but because Perl uses abstract characters, you can use any Unicode code point you please — no matter the platform or build! Few languages offer such convenience. I mention it because I’ve fighting a losing battle with Java a lot over this maddening problem lately.)

The problem is that I do not know how to access a Perl module from Python — apart, that is, from using a shell callout or two-sided pipe. To that end, I have therefore provided you with a complete working script called ucsort that you can call to do exactly what you have asked for with perfect ease.

This script is 100% compliant with the full Unicode Collation Algorithm, with all tailoring options supported!! And if you have an optional module installed or run Perl 5.13 or better, then you have full access to easy-to-use CLDR locales. See below.

Demonstration

Imagine an input set ordered this way:

b o i j n l m å y e v s k h d f g t ö r x p z a ä c u q

A default sort by code point yields:

a b c d e f g h i j k l m n o p q r s t u v x y z ä å ö

which is incorrect by everybody’s book. Using my script, which uses the Unicode Collation Algorithm, you get this order:

% perl ucsort /tmp/swedish_alphabet | fmt
a å ä b c d e f g h i j k l m n o ö p q r s t u v x y z

That is the default UCA sort. To get the Swedish locale, call ucsort this way:

% perl ucsort --locale=sv /tmp/swedish_alphabet | fmt
a b c d e f g h i j k l m n o p q r s t u v x y z å ä ö

Here is a better input demo. First, the input set:

% fmt /tmp/swedish_set
cTD cDD Cöd Cbd cAD cCD cYD Cud cZD Cod cBD Cnd cQD cFD Ced Cfd cOD
cLD cXD Cid Cpd cID Cgd cVD cMD cÅD cGD Cqd Cäd cJD Cdd Ckd cÖD cÄD
Ctd Czd Cxd cHD cND cKD Cvd Chd Cyd cUD Cld Cmd cED Crd Cad Cåd Ccd
cRD cSD Csd Cjd cPD

By code point, that sorts this way:

Cad Cbd Ccd Cdd Ced Cfd Cgd Chd Cid Cjd Ckd Cld Cmd Cnd Cod Cpd Cqd
Crd Csd Ctd Cud Cvd Cxd Cyd Czd Cäd Cåd Cöd cAD cBD cCD cDD cED cFD
cGD cHD cID cJD cKD cLD cMD cND cOD cPD cQD cRD cSD cTD cUD cVD cXD
cYD cZD cÄD cÅD cÖD

But using the default UCA makes it sort this way:

% ucsort /tmp/swedish_set | fmt
cAD Cad cÅD Cåd cÄD Cäd cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD
Cgd cHD Chd cID Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod
cÖD Cöd cPD Cpd cQD Cqd cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD
Cxd cYD Cyd cZD Czd

But in the Swedish locale, this way:

% ucsort --locale=sv /tmp/swedish_set | fmt
cAD Cad cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD Cgd cHD Chd cID
Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod cPD Cpd cQD Cqd
cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD Cxd cYD Cyd cZD Czd cÅD
Cåd cÄD Cäd cÖD Cöd

If you prefer uppercase to sort before lowercase, do this:

% ucsort --upper-before-lower --locale=sv /tmp/swedish_set | fmt
Cad cAD Cbd cBD Ccd cCD Cdd cDD Ced cED Cfd cFD Cgd cGD Chd cHD Cid
cID Cjd cJD Ckd cKD Cld cLD Cmd cMD Cnd cND Cod cOD Cpd cPD Cqd cQD
Crd cRD Csd cSD Ctd cTD Cud cUD Cvd cVD Cxd cXD Cyd cYD Czd cZD Cåd
cÅD Cäd cÄD Cöd cÖD

Customized Sorts

You can do many other things with ucsort. For example, here is how to sort titles in English:

% ucsort --preprocess='s/^(an?|the)\s+//i' /tmp/titles
Anathem
The Book of Skulls
A Civil Campaign
The Claw of the Conciliator
The Demolished Man
Dune
An Early Dawn
The Faded Sun: Kesrith
The Fall of Hyperion
A Feast for Crows
Flowers for Algernon
The Forbidden Tower
Foundation and Empire
Foundation’s Edge
The Goblin Reservation
The High Crusade
Jack of Shadows
The Man in the High Castle
The Ringworld Engineers
The Robots of Dawn
A Storm of Swords
Stranger in a Strange Land
There Will Be Time
The White Dragon

You will need Perl 5.10.1 or better to run the script in general. For locale support, you must either install the optional CPAN module Unicode::Collate::Locale. Alternately, you can install a development versions of Perl, 5.13+, which include that module standardly.

Calling Conventions

This is a rapid prototype, so ucsort is mostly un(der)documented. But this is its SYNOPSIS of what switches/options it accepts on the command line:

    # standard options
    --help|?
    --man|m
    --debug|d

    # collator constructor options
    --backwards-levels=i
    --collation-level|level|l=i
    --katakana-before-hiragana
    --normalization|n=s
    --override-CJK=s
    --override-Hangul=s
    --preprocess|P=s
    --upper-before-lower|u
    --variable=s

    # program specific options
    --case-insensitive|insensitive|i
    --input-encoding|e=s
    --locale|L=s
    --paragraph|p
    --reverse-fields|last
    --reverse-output|r
    --right-to-left|reverse-input

Yeah, ok: that’s really the argument list I use for the call to Getopt::Long, but you get the idea. :)

If you can figure out how to call Perl library modules from Python directly without calling a Perl script, by all means do so. I just don’t know how myself. I’d love to learn how.

In the meantime, I believe this script will do what you need done in all its particular — and more! I now use this for all of text sorting. It finally does what I’ve needed for a long, long time.

The only downside is that --locale argument causes performance to go down the tubes, although it’s plenty fast enough for regular, non-locale but still 100% UCA compliant sorting. Since it loads everything in memory, you probably don’t want to use this on gigabyte documents. I use it many times a day, and it sure it great having sane text sorting at last.

Shantell answered 17/2, 2011 at 1:19 Comment(12)
Why on earth would you call a Perl script to do something there are Python libraries for?Fanning
Because I didn't know there was a Python library, that's why!Shantell
@Lennart: I really prefer native libraries, or at most ones linked to a C API and dynamically loaded (which you sometimes need). I haven’t found the various PyPerl and Inline::Perl solutions very convincing, or robust, or flexible. Or something. They just don’t feel right for some reasons. I last tried this when I needed good charset detection (which I never got, alas).Shantell
Yes, and the other answers here point out two such libraries, one which is included in Pythons standard library, and the other a wrapper around the C library ICU. For good character detection, there is chardet. pypi.python.org/pypi/chardet There is no need to have anything to do with Perl.Fanning
@Lennart: I missed that there was a version that worked with ICU. The locale one isn’t really the same thing, because it does not seem to use the UCA. It also seems very platform dependent. Am I mistaken? Understand that I missed that there was a working ICU linkup for Python, which is why I posted what I did. With ICU you get the UCA , just as one does with Perl. I just don’t think a non-UCA can substitute for the UCA.Shantell
@Lennart: The problem with the cited chardet is that it doesn’t identify MacRoman, which is a requirement. I was going to use that one until I discovered this failing.Shantell
It must be easier to add MacRoman to chardet than to call Perl. :) It's kinda like Latin1, right, but with different character mapping?Fanning
It’s only like Latin1 insofar as it is an 8-bit encoding. Very few code points overlap.Shantell
You are right it has loads of greek characters in it too. That might make it a bit more difficult, but still shouldn't not be too bad. (That the ordinal of a character like á is different is a trivial issue, that's just a matter of changing the map in the detector).Fanning
Using Perl inside Python is just addiction.Anjanette
@funktku: Ya think? :) But honestly, I’ve used Python from inside Perl before. I don’t understand why if there is a library to do something, and it does that thing well, it need be in written in the language calling that library. Don’t we call C code from C++ or Perl or Python? Doesn’t even Java have JNI for these sort of things? The important thing is the algorithm and behavior, not the programming language.Shantell
Wow. Yep - looks like Perl to me, in fact we see that there's now more than two ways to do things :) But calling C from Python doesn't generally imply the sorts of added dependencies and practical support issues that calling Perl would, so its awfully hard to see much call for doing it this way.Levitate
N
1

Lately I've been using zope.ucol (https://pypi.python.org/pypi/zope.ucol) for this task. For example, sorting the german ß:

>>> import zope.ucol
>>> collator = zope.ucol.Collator("de-de")
>>> mylist = [u"a", u'x', u'\u00DF']
>>> print mylist
[u'a', u'x', u'\xdf']
>>> print sorted(mylist, key=collator.key)
[u'a', u'\xdf', u'x']

zope.ucol also wraps ICU, so would be an alternative to PyICU.

Norton answered 10/5, 2013 at 11:8 Comment(0)
G
0

Jeff Atwood wrote a good post on Natural Sort Order, in it he linked to a script which does pretty much what you ask.

It's not a trivial script, by any means, but it does the trick.

Godly answered 8/7, 2009 at 13:11 Comment(0)
A
0

It is far from a complete solution for your use case, but you could take a look at the unaccent.py script from effbot.org. What it basically does is remove all accents from a text. You can use that 'sanitized' text to sort alphabetically. (For a better description see this page.)

Attis answered 8/7, 2009 at 15:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.