Wrong sorting with Collator using Locale.SIMPLIFIED_CHINESE
Asked Answered
B

1

12

I'm trying to order a list of countries in Chinese using Locale.SIMPLIFIED_CHINESE, which seems that it orders using pinyin (phonetic alphabet, that is characters are ordered according to their latin correspondent combination, from A to Z).

But I've found some cases when it orders bad. For example:

  • '中' character is zhong1
  • '梵' character is fan4

The correct order should be 梵 < 中, but instead it is ordered in the other way.

String[] characters = new String[] {"梵", "中"};
List<String> list = Arrays.asList(characters);
System.out.println("Before sorting...");
System.out.println(list.toString());

Collator collator = Collator.getInstance(Locale.SIMPLIFIED_CHINESE);
collator.setStrength(Collator.PRIMARY);
Collections.sort(list, collator);

System.out.println("After sorting...");
System.out.println(list.toString());

Results of this snippet are:

Before sorting...
[梵, 中]
After sorting...
[中, 梵]

Going deeper, I found the rules that Java applies with Locale.SIMPLIFIED_CHINESE. You can find in next image: https://postimg.cc/image/4t915a7gp/full/ (Notice that 梵 is after 中)

I realized before the <口<口<口<口<口 that I highlighted in red, all characters are ordered according to their latin correspondent combination, from A to Z. However, after the <口<口<口<口<口 sign, the characters are ordered by the composition of the character. For example, if all the characters have a same part (usually the left part of the character), they are then grouped together, not according to the A to Z rule.

Also, all the characters after the <口<口<口<口<口 are less common Chinese characters. So, 梵 is a less common character than 中, so it is put after <口<口<口<口<口.

I wonder why this decision, if it is intentionally. But it results in wrong sortings. I don't know how to find a solution for this.

Bevan answered 12/11, 2015 at 13:22 Comment(2)
Have you tried using icu4j?Padron
I've tried pinyin4j, and their order is good. icu4j not try yet. But my question is about why Oracle sort with that rules, maybe it is a bug to report, or maybe there is another way using Java API to sort with pinyin conventions. In my company it is difficult to add new libraries because of fiability. Thanks for your support!Bevan
L
2

The sorting order provided by the collator in Java is based on the strokes needed to write that character.

See below small snippet to demonstrate. Stroke numbers taken from Wikitionary

// the unicode character and the number of strokes
String[] characters = new String[]{
    "\u68B5 (11)", "\u4E2D (4)", 
    "\u5207 (4)", "\u5973 (3)", "\u898B (7)"
};
List<String> list = Arrays.asList(characters);
System.out.println("Before sorting...");
System.out.println(list.toString());

Collator collator = Collator.getInstance(Locale.TRADITIONAL_CHINESE);
collator.setStrength(Collator.PRIMARY);
System.out.println();
Collections.sort(list, collator);

System.out.println("After sorting...");
System.out.println(list.toString());

output

Before sorting...
[梵 (11), 中 (4), 切 (4), 女 (3), 見 (7)]

After sorting...
[女 (3), 中 (4), 切 (4), 見 (7), 梵 (11)]

There is an enhancement request JDK-6415666 to implement the sorting order according the Unicode collation order. But following the information about the Java 8 supported locale it's not implemented in Java 8.

edit The sorting order using the collator from icu4j is

[梵 (11), 見 (7), 女 (3), 切 (4), 中 (4)]

ICU4J code snippet

import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator
...
Locale locale = new Locale("zh", "", "PINYIN");
Collator collator = (RuleBasedCollator) Collator.getInstance(locale);
Loriloria answered 15/1, 2016 at 12:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.