From compilation to runtime, how does Java String encoding really work
Asked Answered
E

4

21

I recently realized that I don't fully understand Java's string encoding process.

Consider the following code:

public class Main
{
    public static void main(String[] args)
    {
        System.out.println(java.nio.charset.Charset.defaultCharset().name());
        System.out.println("ack char: ^"); /* where ^ = 0x06, the ack char */
    }
}

Since the control characters are interpreted differently between windows-1252 and ISO-8859-1, I chose the ack char for testing.

I now compile it with different file encodings, UTF-8, windows-1252, and ISO-8859-1. The both compile to the exact same thing, byte-per-byte as verified by md5sum.

I then run the program:

$ java Main | hexdump -C
00000000  55 54 46 2d 38 0a 61 63  6b 20 63 68 61 72 3a 20  |UTF-8.ack char: |
00000010  06 0a                                             |..|
00000012

$ java -Dfile.encoding=iso-8859-1 Main | hexdump -C
00000000  49 53 4f 2d 38 38 35 39  2d 31 0a 61 63 6b 20 63  |ISO-8859-1.ack c|
00000010  68 61 72 3a 20 06 0a                              |har: ..|
00000017

$ java -Dfile.encoding=windows-1252 Main | hexdump -C
00000000  77 69 6e 64 6f 77 73 2d  31 32 35 32 0a 61 63 6b  |windows-1252.ack|
00000010  20 63 68 61 72 3a 20 06  0a                       | char: ..|
00000019

It correctly outputs the 0x06 no matter which encoding is being used.

Ok, it still outputs the same 0x06, which would be interpreted as the printable [ACK] char by windows-1252 codepages.

That leads me to a few questions:

  1. Is the codepage / charset of the Java file being compiled expected to be identical to the default charset of the system under which it's being compiled? Are the two always synonymous?
  2. The compiled representation doesn't seem dependent on the compile-time charset, is this indeed the case?
  3. Does this imply that strings within Java files may be interpreted differently at runtime if they don't use standard characters for the current charset/locale?
  4. What else should I really know about string and character encoding in Java?
Echelon answered 29/1, 2010 at 20:6 Comment(3)
It's not clear what you mean by "compile it with different file encodings". Do you mean that you save the file in different encodings, then compile each of those files using the -encoding switch to javac? If so, how do you know what random garbage is winding up in the source files after saving them in those encodings? You can't put a literal control character into your source and expect it to survive serialization to encoded characters.Synaeresis
A file is nothing more than a stream of bytes. Those bytes are interpreted differently depending on the character encoding they are assumed to be in. Thus, I'm refering to strings which contain chars that may be interpreted differently, either at runtime or at compile-time, by assuming the file was encoded in different character sets.Echelon
To be explicit about the compilation step, I used sun's encoding property to set the charset at compilation time: javac -encoding windows-1252 Main.java, with the encoding set appropriately.Echelon
S
27
  1. Source files can be in any encoding
  2. You need to tell the compiler the encoding of source files (e.g. javac -encoding...); otherwise, platform encoding is assumed
  3. In class file binaries, string literals are stored as (modified) UTF-8, but unless you work with bytecode, this doesn't matter (see JVM spec)
  4. Strings in Java are UTF-16, always (see Java language spec)
  5. The System.out PrintStream will transform your strings from UTF-16 to bytes in the system encoding prior to writing them to stdout

Notes:

Sateen answered 29/1, 2010 at 20:21 Comment(0)
S
15

A summary of "what to know" about string encodings in Java:

  • A String instance, in memory, is a sequence of 16-bit "code units", which Java handles as char values. Conceptually, those code units encode a sequence of "code points", where a code point is "the number attributed to a given character as per the Unicode standard". Code points range from 0 to a bit more than one million, although only 100 thousands or so have been defined so far. Code points from 0 to 65535 are encoded into a single code unit, while other code points use two code units. This process is called UTF-16 (aka UCS-2). There are a few subtleties (some code points are invalid, e.g. 65535, and there is a range of 2048 code points in the first 65536 reserved precisely for the encoding of the other code points).
  • Code pages and the like do not impact how Java stores the strings in RAM. That's why "Unicode" starts with "Uni". As long as you do not perform I/O with your strings, you are in the world of Unicode where everybody uses the same mapping of characters to code points.
  • Charsets come into action when encoding strings into bytes, or decoding strings from bytes. Unless explicitly specified, Java will use a default charset which depends on the user "locale", a fuzzy aggregate notion of what makes a computer in Japan speak Japanese. When you print out a string with System.out.println(), the JVM will convert the string into something suitable for wherever those characters go, which often means converting them to bytes using a charset which depends on the current locale (or what the JVM guessed of the current locale).
  • One Java application is the Java compiler. The Java compiler needs to interpret the contents of source files, which are, at the system level, just bunch of bytes. The Java compiler then selects a default charset for that, and it does so depending on the current locale, just like Java would do, because the Java compiler is itself written in Java. The Java compiler (javac) accepts a command-line flag (-encoding) which can be used to override that default choice.
  • The Java compiler produces class files which are locale-independent. String literals ends up in those class files with (sort of) UTF-8 encoding, regardless of the charset which the Java compiler used to interpret the source files. The locale on the system on which the Java compiler runs impacts how the source code is interpreted, but once the Java compiler has understood that your string contains the code point number 6, then this code point is what will make its way to the class files, and none other. Note that code points 0 to 127 have the same encoding in UTF-8, CP-1252 and ISO-8859-1, hence what you obtain is no wonder.
  • Even so String instances do not depend on any kind of encoding, as long as they remain in RAM, some of the operations you may want to perform on strings are locale-dependent. This is not a question of encoding; but a locale also defines a "language" and it so happens that the notions of uppercase and lowercase depend on the language which is used. The Usual Suspect is calling "unicode".toUpperCase(): this yields "UNICODE" except if the current locale is Turkish, in which case you get "UNİCODE" (the "I" has a dot). The basic assumption here is that if the current locale is Turkish then the data the application is managing is probably Turkish text; personally, I find this assumption at best questionable. But so it is.

In practical terms, you should specify encodings explicitly in your code, at least most of the time. Do not call String.getBytes(), call String.getBytes("UTF-8"). Use of the default, locale-dependent encoding is fine when it is applied to some data exchanged with the user, such as a configuration file or a message to display immediately; but elsewhere, avoid locale-dependent methods whenever possible.

Among other locale-dependent parts of Java, there are calendars. There is the whole time zone business, which depends on the "time zone", which should relate to the geographical position of the computer (and this is not part of the "locale" stricto sensu...). Also, countless Java application mysteriously fail when run in Bangkok, because in a Thai locale, Java defaults to the Buddhist calendar according to which the current year is 2553.

As a rule of thumb, assume that the World is vast (it is !) and keep things generic (do not do anything which depends on a charset until the very last moment, when I/O must actually be performed).

Shepperd answered 29/1, 2010 at 21:35 Comment(0)
H
3

If you compile with different encodings, these encodings only affect your source files. If you don't have any special characters inside your sources, there will be no difference in the resulting byte code.

For runtime, the default charset of the operating system is used. This is independent from the charset you used for compiling.

Hoffert answered 29/1, 2010 at 20:10 Comment(0)
S
2

Erm based on this and this the ACK control character is exactly the same in both encodings. The difference the link you pointed out is talking about how DOS/Windows actually has symbols for most of the control characters in Windows-1252 (like the Heart/Club/Spade/Diamond characters and simileys) while ISO-8859 does not.

Schaal answered 29/1, 2010 at 20:15 Comment(1)
You are correct, the ack char is 0x06 in both of those encodings. Perhaps I failed, but I was attempting to come up with a scenario in which it would be interpreted differently based on the current charset. @McDowell's blog post does a much better job at demonstrating what I was attempting to do.Echelon

© 2022 - 2024 — McMap. All rights reserved.