Numeric comparing option for Java Collator
Asked Answered
M

2

7

Problem:

Let's say we have the following list of strings {"Test1.txt", "Test2.txt", "Test11.txt", "Test22.txt"}, sorting them using String::compareTo or Collator::compare would result in following order:

Test1.txt
Test2.txt
Test22.txt
Test3.txt

Which is inconvenient(arguably), while a more human-friendly outcome is:

Test1.txt
Test2.txt
Test3.txt
Test22.txt

To resolve this issues we can write our own compare method which is numeric sensitive. But what if we want numeric sensitive sort as well as the benefit of using existing implementation of Collator (or to avoid implementing one) for internationalization?

Is there a right way to handle this? or maybe a reliable library that addresses this problem?

Other Languages:

In Javascript world the Intl.Collator's constructors accepts a CollatorOption which allows you to set configs to achieve such functionality and more:

const usCollator = Intl.Collator("us", { numeric: true });
const list = ["Test1.txt", "Test2.txt", "Test3.txt", "Test22.txt"];
list.sort(usCollator.compare);
console.log(list);
Mollescent answered 28/10, 2022 at 19:42 Comment(2)
#105099Aires
an idea: tokenize the string in a (prefix, number) couple, then compare in the following way: s1 = (p1, n1) < s2 = (p2, n2) iff p1 < p2 (in the existing collator sense) or p1 == p2 and n1 < n2 (in the usual number sense); it is possible to generalize it having string = sequence of tokens easilyRundlet
D
2

You can use alphanumeric-comparator, which is available in Maven.

Decapod answered 3/11, 2022 at 12:16 Comment(2)
I couldn't find that one in maven repo but I've found this one github. Thank you btw.Mollescent
You're right, I linked to a Github repo which is not published to Maven central. I have edited my answer to point to the other library that you found.Decapod
H
0

There is such class as RuleBasedCollator, but it is limited to single chars so therefore, Collator isn't the way to go in general.

A pretty good solution to this problem would be to copy and adapt the C++ implementation that open-source browser engines have of the compare function. I tried to go that route (tried to see Webkit's implementation) but since I don't know C++, I failed to understand exactly what is happening.

Therefore, I decided to implement a couple of basic methods in Java that do this. The current solution works for UTF-8 encoding, but with some tweaks it can be adapted to the encoding of choice too. Here it is:

import java.nio.charset.StandardCharsets;
import java.text.ParseException;
import java.util.Arrays;

public class AlphaCompare {
    public static void main(String[] args) throws ParseException {
        var strs = Arrays.asList(
                "Test1.txt",
                "Test2.txt",
                "Test11.txt",
                "Test22.txt",
                "123",
                "a123",
                "a123a",
                "a1231"
        );
        strs.sort(String::compareTo);
        System.out.println("Standard String::compareTo sorting:");
        strs.forEach(System.out::println);
        System.out.println("-----------------------------------");
        strs.sort(AlphaCompare::compareTo);
        System.out.println("Custom sorting:");
        strs.forEach(System.out::println);
    }

    public static int compareTo(String thisString, String anotherString) {
        byte v1[] = thisString.getBytes(StandardCharsets.UTF_8);
        byte v2[] = anotherString.getBytes(StandardCharsets.UTF_8);
        return compareChars(v1, v2, v1.length, v2.length);
    }

    private static int compareChars(byte[] value, byte[] other, int len1, int len2) {
        int lim = Math.min(len1, len2);
        for (int k = 0; k < lim; k++) {
            char c1 = (char) (value[k] & 0xFF);
            char c2 = (char) (other[k] & 0xFF);
            if (Character.isDigit(c1) && Character.isDigit(c2)) {
                int d1 = Character.getNumericValue(c1);
                int d2 = Character.getNumericValue(c2);
                StringBuilder d1Sb = new StringBuilder(d1);
                StringBuilder d2Sb = new StringBuilder(d2);
                appendIfDigit(value, lim, k, d1Sb);
                appendIfDigit(other, lim, k, d2Sb);
                final int digit1 = Integer.valueOf(d1Sb.toString());
                final int digit2 = Integer.valueOf(d2Sb.toString());
                if (digit1 != digit2) {
                    return digit1 - digit2;
                }
            }
            if (c1 != c2) {
                return c1 - c2;
            }
        }
        return len1 - len2;
    }

    private static void appendIfDigit(byte[] value, int lim, int k, StringBuilder d1Sb) {
        for (int l = k; l < lim; l++) {
            char cd1 = (char) (value[l] & 0xFF);
            if (Character.isDigit(cd1)) {
                d1Sb.append(Character.getNumericValue(cd1));
            } else {
                break;
            }
        }
    }
}

This outputs the following:

Standard String::compareTo sorting:
123
Test1.txt
Test11.txt
Test2.txt
Test22.txt
a123
a1231
a123a
-----------------------------------
Custom sorting:
123
Test1.txt
Test2.txt
Test11.txt
Test22.txt
a123
a123a
a1231

Process finished with exit code 0

Happy hacking! =)

Hafner answered 3/11, 2022 at 12:12 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.