Is there a standard way to sort by a non-english alphabet? For example, the romanian alphabet is "a ă â b c..." [duplicate]
Asked Answered
T

2

5

Possible Duplicate:
How do I sort unicode strings alphabetically in Python?

As a citizen of the Rest-of-the-World, I'm really annoyed by the fact that computers aren't adapted by default to deal with international issues. Many sites still don't use Unicode and PHP is still in the Dark Ages.

When I want to sort a list of words or names in romanian I always have to write my own functions, which are hardly efficient. There must be some locale setting that makes sort functions obey the alphabet order of the specified language, right?

I'm mainly interested in Python, Java and JavaScript.

EDIT: I found my answer for Python here, as pointed out by Chris Morgan.

Traumatize answered 19/5, 2011 at 9:55 Comment(5)
Um, can't you get the byte values in UTF-8 and sort according to that? I mean, most sort functions obey the order of the comparator you give them, which you can define any way you like...Brother
No. In Unicode 'z' comes before 'ă'. That's the whole point.Traumatize
I realise that duplicate only deals with Python, but there is ICU for Java as well - no JavaScript version though.Gerdes
Yup, you're right Chris. My search skills weren't able to find that. I found my answer for Python: set the locale and then theList.sort(cmp = locale.strcoll).Traumatize
What is even worse, that the characters alone are sometimes not enough for sorting. For example, in Hungarian, the word "csiga" (snail) comes after "cukor" (sugar), and not before. Why? Because "cs" is considered to be one letter, even if it is expressed by two glyphs.Monolith
C
8

In Python, you can always use sorted function with a key parameter. For example, in Turkish, we have letters like 'ç','ı','ş' etc. If I want to sort according to that letter, I would use a key string which letters is sorted, and sort the string according to this, like this:

>>> letters="abcçdefgğhıijklmnoöprsştuüvyz" #Turkish alphabet
>>> sorted("açobzöğge")
['a', 'b', 'e', 'g', 'o', 'z', 'ç', 'ö', 'ğ'] #Python's default
>>> sorted("açobzöğge", key=lambda i: letters.index(i))
['a', 'b', 'ç', 'e', 'g', 'ğ', 'o', 'ö', 'z'] #With key parameter

Note: With Python 3; dealing with Unicode is easier.

Edit, as said by comments, this process would be more efficent if we use a dictionary:

>>> letters="abcçdefgğhıijklmnoöprsştuüvyz"
>>> d={i:letters.index(i) for i in letters}
>>> sorted("açobzöğge", key=d.get)
['a', 'b', 'ç', 'e', 'g', 'ğ', 'o', 'ö', 'z']
Crowns answered 19/5, 2011 at 10:11 Comment(4)
I think that's exactly what he said hardly efficient.Charmeuse
I don't think it's that inefficient. And I don't think there is a more efficent code for that(maybe some minor tweaks).Crowns
using index many times is less efficient than to prepare a dictionary mapping letter to an integer.Ebony
thanks, edited answer with dictionary.Crowns
Z
4

There is no single, unified sorting algorithm that's correct for all languages, because many languages have very specific sorting rules.

It goes even further than that: even within a single language, the sorting algorithm can vary depending on what it's used for (for example in German dictionaries are sorted slightly different from phone books).

The entire topic is called Collation. The Wikipedia article on Collating sequence is relevant as well.

There seems to be a project that implements correct collation for many languages called python-collate.

Zootechnics answered 19/5, 2011 at 10:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.