Case insensitive comparisons across locales in Java
Asked Answered
Q

1

6

Considering the following Java code comparing a small string containing the German grapheme ß

String a = "ß";
String b = a.toUpperCase();

assertTrue(a.equalsIgnoreCase(b));

The comparison fails, because "ß".toUpperCase() is actually equal to "SS", and that ends up failing a check in equalsIgnoreCase(). The Javadocs for toUpperCase() do mention this case explicitly, however I don't understand why this does not go to ẞ, the capital variant of ß?

More generally, how should we do case insensitive comparisons, potentially across different locales. Should we just always use either toUpper() or equalsIgnoreCase(), but never both?

It seems that the problem is that the implementation of equalsIgnoreCase() includes the following check: anotherString.value.length == value.length, which seems incompatible with the Javadocs for toUpper(), which state:

Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String.

Quits answered 15/5, 2017 at 21:9 Comment(4)
You would need to use a Collator instead of the built-in methods of String.Garnes
SS is the uppercase because it's defined to be in Unicode.Denicedenie
@AndyTurner that's weird, because there is a Unicode code point for the upper case character, and it defines this character as it's lower case character fileformat.info/info/unicode/char/1e9e/index.htmQuits
@AndyTurner Is this just to do with the fact that the capital was introduced in 2008, and the original in 1993?Quits
G
7

Java's Collator class is designed for different locale-sensitive text comparison operations. Since the concept of "upper-case" varies quite a bit between locales, Collator uses a more fine-grained model called comparison strength. There are four levels provided, and how they affect comparisons is locale-dependent.

Here's an example of using Collator with the German locale for case-insensitive comparison of the letter ß:

Collator germanCollator = Collator.getInstance(Locale.GERMAN);
int[] strengths = new int[] {Collator.PRIMARY, Collator.SECONDARY,
                             Collator.TERTIARY, Collator.IDENTICAL};

String a = "ß";
String b = "ß".toUpperCase();

for (int strength : strengths) {
    germanCollator.setStrength(strength);
    if (germanCollator.compare(a, b) == 0) {
        System.out.println(String.format(
                "%s and %s are equivalent when comparing differences with "
                + "strength %s using the GERMAN locale.",
                a, b, String.valueOf(strength)));
    }
}

The code prints out

ß and SS are equivalent when comparing differences with strength 0 using the GERMAN locale.
ß and SS are equivalent when comparing differences with strength 1 using the GERMAN locale.

which means that the German locale considers these two strings equal in PRIMARY and SECONDARY strength comparisons.

Garnes answered 15/5, 2017 at 21:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.