A Complete UCA Solution
The simplest, easiest, and most straightforward way to do this it to make a callout to the Perl library module, Unicode::Collate::Locale, which is a subclass of the standard Unicode::Collate module. All you need do is pass the constructor a locale value of "xv"
for Sweden.
(You may not neccesarily appreciate this for Swedish text, but because Perl uses abstract characters, you can use any Unicode code point you please — no matter the platform or build! Few languages offer such convenience. I mention it because I’ve fighting a losing battle with Java a lot over this maddening problem lately.)
The problem is that I do not know how to access a Perl module from Python — apart, that is, from using a shell callout or two-sided pipe. To that end, I have therefore provided you with a complete working script called ucsort that you can call to do exactly what you have asked for with perfect ease.
This script is 100% compliant with the full Unicode Collation Algorithm, with all tailoring options supported!! And if you have an optional module installed or run Perl 5.13 or better, then you have full access to easy-to-use CLDR locales. See below.
Demonstration
Imagine an input set ordered this way:
b o i j n l m å y e v s k h d f g t ö r x p z a ä c u q
A default sort by code point yields:
a b c d e f g h i j k l m n o p q r s t u v x y z ä å ö
which is incorrect by everybody’s book. Using my script, which uses the Unicode Collation Algorithm, you get this order:
% perl ucsort /tmp/swedish_alphabet | fmt
a å ä b c d e f g h i j k l m n o ö p q r s t u v x y z
That is the default UCA sort. To get the Swedish locale, call ucsort this way:
% perl ucsort --locale=sv /tmp/swedish_alphabet | fmt
a b c d e f g h i j k l m n o p q r s t u v x y z å ä ö
Here is a better input demo. First, the input set:
% fmt /tmp/swedish_set
cTD cDD Cöd Cbd cAD cCD cYD Cud cZD Cod cBD Cnd cQD cFD Ced Cfd cOD
cLD cXD Cid Cpd cID Cgd cVD cMD cÅD cGD Cqd Cäd cJD Cdd Ckd cÖD cÄD
Ctd Czd Cxd cHD cND cKD Cvd Chd Cyd cUD Cld Cmd cED Crd Cad Cåd Ccd
cRD cSD Csd Cjd cPD
By code point, that sorts this way:
Cad Cbd Ccd Cdd Ced Cfd Cgd Chd Cid Cjd Ckd Cld Cmd Cnd Cod Cpd Cqd
Crd Csd Ctd Cud Cvd Cxd Cyd Czd Cäd Cåd Cöd cAD cBD cCD cDD cED cFD
cGD cHD cID cJD cKD cLD cMD cND cOD cPD cQD cRD cSD cTD cUD cVD cXD
cYD cZD cÄD cÅD cÖD
But using the default UCA makes it sort this way:
% ucsort /tmp/swedish_set | fmt
cAD Cad cÅD Cåd cÄD Cäd cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD
Cgd cHD Chd cID Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod
cÖD Cöd cPD Cpd cQD Cqd cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD
Cxd cYD Cyd cZD Czd
But in the Swedish locale, this way:
% ucsort --locale=sv /tmp/swedish_set | fmt
cAD Cad cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD Cgd cHD Chd cID
Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod cPD Cpd cQD Cqd
cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD Cxd cYD Cyd cZD Czd cÅD
Cåd cÄD Cäd cÖD Cöd
If you prefer uppercase to sort before lowercase, do this:
% ucsort --upper-before-lower --locale=sv /tmp/swedish_set | fmt
Cad cAD Cbd cBD Ccd cCD Cdd cDD Ced cED Cfd cFD Cgd cGD Chd cHD Cid
cID Cjd cJD Ckd cKD Cld cLD Cmd cMD Cnd cND Cod cOD Cpd cPD Cqd cQD
Crd cRD Csd cSD Ctd cTD Cud cUD Cvd cVD Cxd cXD Cyd cYD Czd cZD Cåd
cÅD Cäd cÄD Cöd cÖD
Customized Sorts
You can do many other things with ucsort. For example, here is how to sort titles in English:
% ucsort --preprocess='s/^(an?|the)\s+//i' /tmp/titles
Anathem
The Book of Skulls
A Civil Campaign
The Claw of the Conciliator
The Demolished Man
Dune
An Early Dawn
The Faded Sun: Kesrith
The Fall of Hyperion
A Feast for Crows
Flowers for Algernon
The Forbidden Tower
Foundation and Empire
Foundation’s Edge
The Goblin Reservation
The High Crusade
Jack of Shadows
The Man in the High Castle
The Ringworld Engineers
The Robots of Dawn
A Storm of Swords
Stranger in a Strange Land
There Will Be Time
The White Dragon
You will need Perl 5.10.1 or better to run the script in general. For locale support, you must either install the optional CPAN module Unicode::Collate::Locale
. Alternately, you can install a development versions of Perl, 5.13+, which include that module standardly.
Calling Conventions
This is a rapid prototype, so ucsort is mostly un(der)documented. But this is its SYNOPSIS of what switches/options it accepts on the command line:
# standard options
--help|?
--man|m
--debug|d
# collator constructor options
--backwards-levels=i
--collation-level|level|l=i
--katakana-before-hiragana
--normalization|n=s
--override-CJK=s
--override-Hangul=s
--preprocess|P=s
--upper-before-lower|u
--variable=s
# program specific options
--case-insensitive|insensitive|i
--input-encoding|e=s
--locale|L=s
--paragraph|p
--reverse-fields|last
--reverse-output|r
--right-to-left|reverse-input
Yeah, ok: that’s really the argument list I use for the call to Getopt::Long
, but you get the idea. :)
If you can figure out how to call Perl library modules from Python directly without calling a Perl script, by all means do so. I just don’t know how myself. I’d love to learn how.
In the meantime, I believe this script will do what you need done in all its particular — and more! I now use this for all of text sorting. It finally does what I’ve needed for a long, long time.
The only downside is that --locale
argument causes performance to go down the tubes, although it’s plenty fast enough for regular, non-locale but still 100% UCA compliant sorting. Since it loads everything in memory, you probably don’t want to use this on gigabyte documents. I use it many times a day, and it sure it great having sane text sorting at last.
locale.strcoll
answer is correct when you need Unicode sorting using the user's locale, and the ICU answer what you want when you need more than that (collation using more than one locale). Most of the time, you wantlocale.strcoll
. – Rosadolocale.strcoll
works and especially what ICU does better than the Python function. Basically some more attention for the question. – Stipulation--locale=de__phonebook
when you need it. The Perl module passes the UCA test suite, and the script I provided makes it a lot easier to play with the whole UCA plus all its options including locales, just from the command line. Might not answer the question, but it should still be highly interesting. If you’re in Switzerland, I am sure you could use the flexibility. :) – Shantell