You can confirm the following by looking at the source code of the relevant version of the java.lang.String
class in OpenJDK. (For some really old versions of Java, String
was partly implemented in native code. That source code is not publicly available.)
Prior to Java 9, the standard in-memory representation for a Java String
is UTF-16 code-units held in a char[]
.
With Java 6 update 21 and later, there was a non-standard option (-XX:UseCompressedStrings
) to enable compressed strings. This feature was removed in Java 7.
For Java 9 and later, the implementation of String
has been changed to use a compact representation by default. The java
command documentation now says this:
-XX:-CompactStrings
Disables the Compact Strings feature. By default, this option is enabled. When this option is enabled, Java Strings containing only single-byte characters are internally represented and stored as single-byte-per-character Strings using ISO-8859-1 / Latin-1 encoding. This reduces, by 50%, the amount of space required for Strings containing only single-byte characters. For Java Strings containing at least one multibyte character: these are represented and stored as 2 bytes per character using UTF-16 encoding. Disabling the Compact Strings feature forces the use of UTF-16 encoding as the internal representation for all Java Strings.
Note that neither classical, "compressed" or "compact" strings ever used UTF-8 encoding as the String
representation. Modified UTF-8 is used in other contexts; e.g. in class files, and the object serialization format.
See also:
To answer your specific questions:
Modified UTF-8? Or UTF-16? Which one is correct?
Either UTF-16 or an adaptive representation that depends on the actual data; see above.
And how many bytes does Java use for a char in memory?
A single char
uses 2 bytes. There might be some "wastage" due to possible padding, depending on the context.
A char[]
is 2 bytes per character plus the object header (typically 12 bytes including the array length) padded to (typically) a multiple of 8 bytes.
Please let me know which one is correct and how many bytes it uses.
If we are talking about a String
now, it is not possible to give a general answer. It will depend on the Java version and hardware platform, as well as the String
length and (in some cases) what the characters are. Indeed, for some versions of Java it even depends on how you created the String
.
Having said all of the above, the API model for String
is that it is both a sequence of UTF-16 code-units and a sequence of Unicode code-points. As a Java programmer, you should be able to ignore everything that happens "under the hood". The internal String
representation is (should be!) irrelevant.