Case insensitive comparisons across locales in Java

Considering the following Java code comparing a small string containing the German grapheme ß

String a = "ß";
String b = a.toUpperCase();

assertTrue(a.equalsIgnoreCase(b));

The comparison fails, because "ß".toUpperCase() is actually equal to "SS", and that ends up failing a check in equalsIgnoreCase(). The Javadocs for toUpperCase() do mention this case explicitly, however I don't understand why this does not go to ẞ, the capital variant of ß?

More generally, how should we do case insensitive comparisons, potentially across different locales. Should we just always use either toUpper() or equalsIgnoreCase(), but never both?

It seems that the problem is that the implementation of equalsIgnoreCase() includes the following check: anotherString.value.length == value.length, which seems incompatible with the Javadocs for toUpper(), which state:

Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String.

Java's Collator class is designed for different locale-sensitive text comparison operations. Since the concept of "upper-case" varies quite a bit between locales, Collator uses a more fine-grained model called comparison strength. There are four levels provided, and how they affect comparisons is locale-dependent.

Here's an example of using Collator with the German locale for case-insensitive comparison of the letter ß:

Collator germanCollator = Collator.getInstance(Locale.GERMAN);
int[] strengths = new int[] {Collator.PRIMARY, Collator.SECONDARY,
                             Collator.TERTIARY, Collator.IDENTICAL};

String a = "ß";
String b = "ß".toUpperCase();

for (int strength : strengths) {
    germanCollator.setStrength(strength);
    if (germanCollator.compare(a, b) == 0) {
        System.out.println(String.format(
                "%s and %s are equivalent when comparing differences with "
                + "strength %s using the GERMAN locale.",
                a, b, String.valueOf(strength)));
    }
}

The code prints out

ß and SS are equivalent when comparing differences with strength 0 using the GERMAN locale.
ß and SS are equivalent when comparing differences with strength 1 using the GERMAN locale.

which means that the German locale considers these two strings equal in PRIMARY and SECONDARY strength comparisons.

Recommended topics

Hot tags