how to sort non-english strings?
Asked Answered
P

2

5

I did look up answers, and they are good for the standard alphabet. but I have a different situation than that.

so, I am programming in Java. I am writing a certain program. this program has at some place some list of string items. I would like to sort those string items according to the alphabet.

if I would sort it by English alphabet, it would be easy since usually all code pages are compatible with American standard code for information interchange (ASCII), and they have all letters of English alphabet already sorted, so, if I would like to sort my list, I would only have to compare the values of chars to determine which letter goes where.

but my problem is, that I do not want to sort a list by using the English alphabet. my program has the option to display in English or some other languages. the problem is that some of those languages have different alphabet from the English alphabet, therefore letters are not the same as those in the English alphabet, and thus simple <, and > validation of char values does not work because letters are not sorted correctly in the code page.

for the purposes of this question lets say English alphabet is as follows:

a,
b,
c,
d,
e,
f,
g.

let's say there is a certain country named "ABC" whose alphabet goes like this:

d,
b,
g,
e,
a,
c,
f.

first of all, if a is equal to 97 on code page, b 98, c 99 et cetera, how can I sort my list using the second alphabet in this example, since the second alphabet has its first letter equal to 100, second equal to 98, third to 103 et cetera?

and my second question: unfortunately, some of the countries I am translating my program too has alphabet where some combinations of letters are treated as one letter. for my second example, let's say that country "def" has the following alphabet:

d,
g,
be,
e,
fe,
c,
f.

here: d - the first letter in the alphabet, g - second letter in the alphabet, be - third letter in the alphabet (ONE letter, although it is written as two letters, it is considered to be just one letter, and has its position in the alphabet), e - fourth letter in the alphabet, the - fifth letter in the alphabet (also written as two letters, but treated as ONE letter), c - sixth letter in the alphabet, f - seventh letter in the alphabet.

as you can see in this imaginary example number 2 of imaginary country "def", this country has really screwed up the alphabet. and after presenting these two examples of these two alphabets of two imaginary countries, you understand why I cannot use the standard method for sorting strings.

so, can you please help me out with this sorting. I am not sure what I can do to sort according to this screwed up alphabet.

post scriptum: lines below this are not important for the question, but they are just more info if anyone wants to know where I have found such screwed up alphabet

well, i gave those examples which consists of 7 randomly ordered letters just for the purpose of this question - to make it more simple. in case you wonder, what my real problem is - i am trying to translate my program to croatian. croatian alphabet is really screwed up because it goes as follows:

1 |a
2 |b
3 |c
4 |č
5 |ć
6 |d
7 |đ
8 |đž
9 |e
10|f
11|g
12|h
13|i
14|j
15|k
16|l
17|lj
18|m
19|n
20|nj
21|o
22|p
23|r
24|s
25|š
26|t
27|u
28|v
29|z
30|ž

as you can see, Croatian alphabet is somewhat similar to the English alphabet, but most of the letters are not at the same location as English ones, and several of them do not exist in English alphabet at all, and several letters are one letter which is written as two letters. so really difficult to sort. so I hope someone knows some method of doing it. of course, there is the dumbest method for sorting which will always work and can sort anything, and that is method with switch statement, where I compare two string items, and for each letter i use switch statement where switch statement has 31+default=32 cases from which, each of them has its own switch with 32 cases. what is in total 1024 cases, and if my average case has 4 lines of code, I end up that if I want to sort strings using the non-English alphabet, that my sort method would be at least 4096 lines long. and that is a huge method. this is the dumbest way of sorting, but only one I can figure out at the moment. so I am asking here because I hope someone would know any simpler method to do this. the method which is not so big as 4k lines of code just to sort stupid strings. I have a method for sorting English strings and it takes up only a bit more than 10 lines of code. I hope someone can suggest me something less than 4k lines of code.

so if anyone knows the simpler solution, I would appreciate it.

thanx.

Persson answered 19/12, 2016 at 4:42 Comment(7)
Are not "đž", "lj" and "nj" single letters in Croatian? They are in this Wikipedia article: https://en.wikipedia.org/wiki/Gaj%27s_Latin_alphabet. Perhaps you'll need to handle both variants but I think you'll find dž, lj, and nj work better with Locale than the non-digraph (incorrect?) versions.Xavierxaviera
hay, yes, that is correct. this is what makes it more difficult to sort. since, for example "l" is one letter "j" is another, but when they are one by another they are considered to be third letter. i mentioned this in my question. but anyway, answer below works nice, and it sorts letters correctly, so i do not have to worry about this anymore.Persson
Good. My point is that there are single characters for the digraphs. Try copying one from the last sentence in my previous comment.Xavierxaviera
oh, really? i have not noticed it. thank you. i did not know that those characters have their own code. i thought they are always written as two characters lj or nj... cool, now i learned something new. but i do not think i will apply it. because my application is console application (although it has some sort of gui), but i use one code page in my program, and this code page has all croatian letters just letters like lj nj and đž does not exist as a single letter. i went truhth the whole code page from char 1 to 32768, and i did not find those letters there.Persson
so since this code page has l j n đ ž as separate letters, i will include support just for those. beside, it is not required to handle both đ ž as two letters and đž as one letter, because even croatian keyboard layout does not have those characters as a single letter. this means that when users are typing they would have to type in for for example "đž" đ and then ž, because there is no key which has "đž" on it. so, for standard users it would not be likely that they will use this variant you suggested with single letter.Persson
croats read them as single letter, but are used to write them separately. since keyboard does not have those keys, users would have to copy-paste those letters if they want to use it. but i think even if someone does it, it is not important, since most users will not do it. they will type in l and j separately and n and j and đ and ž. plus my project is not intended for a lot of users. it is just a simple server application for my college. no one would probably use single letters for them. even if it does, it is okay if it ends up that it is not working.Persson
but anyway, thank you for mentioning this to me. i will remember in case i would need it any time in the future.Persson
D
8

You use a Collator for that. Collators are Java's way to handle internationalized comparisons.

List<String> mylist = ...;
Locale croatian = new Locale("hr", "HR");
// Put whatever Locale you need as the argument to the getInstance method.
Collator collator = Collator.getInstance(croatian);
Collections.sort(mylist, collator);

Locale is not just "language" but also many other conventions. It is possible for the same language to be sorted differently depending on the country or region or convention within the country - that's why a Locale is identified by at most 3 parts: "country", "region" and "variant".

Danube answered 19/12, 2016 at 4:50 Comment(3)
@Andreas nice oneDanube
hello Erwin. well, thank you a lot for your example, and to Andreas for providing sample class. i newer before heard for "collator". i took java course, but we never talked about this sort of things. we learned only "bubble sort", and after that we were told that java has methods in several classes which can be used to sort things. we were talking like 10 minutes in total about sorting in that course, and newer considered things like sorting with non english alphabets.Persson
although i took java course on croatian, teacher was avoiding using croatian letters, and all tasks we were doing were either on english, or on croatian, but with all croatian letters replaced with closest english variants. so i never needed to sort non standard letters such as šđđćžč, until now. anyway, thank you for your answer, it really helped. you just saved me 2 days of programming. because my method, only one i could imagine would require at least 4096 lines of code. anyway, thank you.Persson
R
3

The concept is called collation. You can look up the concept to know more about it. For example, Oracle/Sun has a tutorial about this concept:

https://docs.oracle.com/javase/tutorial/i18n/text/rule.html

Rathe answered 19/12, 2016 at 4:47 Comment(1)
FYI, answers consisting only of a link are discouraged an StackOverflow (because the link could break and then your answer would be useless). This would have been better as a comment.Underworld

© 2022 - 2024 — McMap. All rights reserved.