Lexicographic Order in Java
Asked Answered
I

4

13

How is the lexicographic order defined in Java especially in reference to special characters like !, . and so on?

An examplary order can be found here

But how does Java define it's order? I ask because I'm sorting Strings on Java and on Oracle and come up with different results and can't find the specification for the lexicographic order.

Impanel answered 24/10, 2011 at 11:33 Comment(1)
If you need to change the ordering for natural languages or to match Oracle's ordering see java.text.Collator.Ethbun
R
27

From the docs for String.compareTo:

Compares two strings lexicographically. The comparison is based on the Unicode value of each character in the strings.

and

This is the definition of lexicographic ordering. If two strings are different, then either they have different characters at some index that is a valid index for both strings, or their lengths are different, or both. If they have different characters at one or more index positions, let k be the smallest such index; then the string whose character at position k has the smaller value, as determined by using the < operator, lexicographically precedes the other string. In this case, compareTo returns the difference of the two character values at position k in the two string [...]

So basically, it treats each string like a sequence of 16-bit unsigned integers. No cultural awareness, no understanding of composite characters etc. If you want a more complex kind of sort, you should be looking at Collator.

Radiolarian answered 24/10, 2011 at 11:38 Comment(0)
J
6

In Java it's based on the Unicode value of the string:

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#compareTo(java.lang.String)

In Oracle, it will depend on the charset you are using on your database. You'll want it to be UTF-8 to have consistent behavior with Java.

To check the character set:

SQL> SELECT parameter, value FROM nls_database_parameters 
     WHERE parameter = 'NLS_CHARACTERSET';

PARAMETER             VALUE 
------------------    ---------------------
NLS_CHARACTERSET      UTF8

If it's not UTF-8, then you can get different comparison behavior depending on which character set your Oracle database is using.

Jabot answered 24/10, 2011 at 11:39 Comment(1)
Although this comment helped me the most, I marked @jonskeet answer as correct because of the phrasing of the question. It turns out that the database used alutf8 encoding (default) and not utf8. For testing purposes I set up a database using utf8 and everything was sorted as expected. alutf8 orders "." after characters (It was an "M" for me) while using utf8 resulted in "." ordered before "M". Very annoying.Impanel
F
2

from the javadocs:

The comparison is based on the Unicode value of each character in the strings.

more detailed:

This is the definition of lexicographic ordering. If two strings are different, then either they have different characters at some index that is a valid index for both strings, or their lengths are different, or both. If they have different characters at one or more index positions, let k be the smallest such index; then the string whose character at position k has the smaller value, as determined by using the < operator, lexicographically precedes the other string. In this case, compareTo returns the difference of the two character values at position k in the two string ...

Feature answered 24/10, 2011 at 11:38 Comment(0)
A
0

Hope this helps!!

Employee sorted based on the descending order of the score and if two different employee has same score, then we need to consider Employee name for sorting lexicographically.

Employee class implementation: (Used Comparable interface for this case.)

@Override
public int compareTo(Object obj) {
    Employee emp = (Employee) obj;

    if(emp.getScore() > this.score) return 1;
    else if(emp.getScore() < this.score) return -1;
    else
        return emp.getEmpName().compareToIgnoreCase(this.empName) * -1;
}
Anele answered 21/8, 2016 at 15:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.