Is there a standard way to sort by a non-english alphabet? For example, the romanian alphabet is "a ă â b c..." [duplicate]

Asked 19/5, 2011 at 9:55 Answered 19/5, 2011 at 10:44

python sorting standards alphabetical non-english

Possible Duplicate:
How do I sort unicode strings alphabetically in Python?

As a citizen of the Rest-of-the-World, I'm really annoyed by the fact that computers aren't adapted by default to deal with international issues. Many sites still don't use Unicode and PHP is still in the Dark Ages.

When I want to sort a list of words or names in romanian I always have to write my own functions, which are hardly efficient. There must be some locale setting that makes sort functions obey the alphabet order of the specified language, right?

I'm mainly interested in Python, Java and JavaScript.

EDIT: I found my answer for Python here, as pointed out by Chris Morgan.

Traumatize answered 19/5, 2011 at 9:55 Comment(5)

Um, can't you get the byte values in UTF-8 and sort according to that? I mean, most sort functions obey the order of the comparator you give them, which you can define any way you like... – Brother 19/5, 2011 at 9:56

No. In Unicode 'z' comes before 'ă'. That's the whole point. – Traumatize 19/5, 2011 at 9:59

I realise that duplicate only deals with Python, but there is ICU for Java as well - no JavaScript version though. – Gerdes 19/5, 2011 at 10:14

Yup, you're right Chris. My search skills weren't able to find that. I found my answer for Python: set the locale and then theList.sort(cmp = locale.strcoll). – Traumatize 19/5, 2011 at 10:28

What is even worse, that the characters alone are sometimes not enough for sorting. For example, in Hungarian, the word "csiga" (snail) comes after "cukor" (sugar), and not before. Why? Because "cs" is considered to be one letter, even if it is expressed by two glyphs. – Monolith 25/6, 2011 at 16:52

In Python, you can always use sorted function with a key parameter. For example, in Turkish, we have letters like 'ç','ı','ş' etc. If I want to sort according to that letter, I would use a key string which letters is sorted, and sort the string according to this, like this:

>>> letters="abcçdefgğhıijklmnoöprsştuüvyz" #Turkish alphabet
>>> sorted("açobzöğge")
['a', 'b', 'e', 'g', 'o', 'z', 'ç', 'ö', 'ğ'] #Python's default
>>> sorted("açobzöğge", key=lambda i: letters.index(i))
['a', 'b', 'ç', 'e', 'g', 'ğ', 'o', 'ö', 'z'] #With key parameter

Note: With Python 3; dealing with Unicode is easier.

Edit, as said by comments, this process would be more efficent if we use a dictionary:

>>> letters="abcçdefgğhıijklmnoöprsştuüvyz"
>>> d={i:letters.index(i) for i in letters}
>>> sorted("açobzöğge", key=d.get)
['a', 'b', 'ç', 'e', 'g', 'ğ', 'o', 'ö', 'z']

Crowns answered 19/5, 2011 at 10:11 Comment(4)

I think that's exactly what he said hardly efficient. – Charmeuse 19/5, 2011 at 10:13

I don't think it's that inefficient. And I don't think there is a more efficent code for that(maybe some minor tweaks). – Crowns 19/5, 2011 at 10:21

using index many times is less efficient than to prepare a dictionary mapping letter to an integer. – Ebony 19/5, 2011 at 10:49

thanks, edited answer with dictionary. – Crowns 19/5, 2011 at 11:8

There is no single, unified sorting algorithm that's correct for all languages, because many languages have very specific sorting rules.

It goes even further than that: even within a single language, the sorting algorithm can vary depending on what it's used for (for example in German dictionaries are sorted slightly different from phone books).

The entire topic is called Collation. The Wikipedia article on Collating sequence is relevant as well.

There seems to be a project that implements correct collation for many languages called python-collate.

Zootechnics answered 19/5, 2011 at 10:44 Comment(0)

Recommended topics

Hot tags