How to Find the Default Charset/Encoding in Java?
Asked Answered
I

6

99

The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?

We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,

public class CharSetTest {

    public static void main(String[] args) {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.setProperty("file.encoding", "Latin-1");
        System.out.println("file.encoding=" + System.getProperty("file.encoding"));
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
        OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
        String enc = writer.getEncoding();
        return enc;
    }
}

Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,

-Dfile.encoding=ISO-8859-1

Here is the result on Java 5,

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.

Is this a bug or feature?

EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

Interesting answered 17/11, 2009 at 13:55 Comment(5)
That's odd, since the defaultCharset uses a static variable which is set only once (accoring to the docs - at VM startup). What VM Vendor are you using?Nadiya
I was able to reproduce this on Java 5, both on Sun/Linux and Apple/OS X.Interesting
That explains why defaultCharset() not caching the result. I still need to find out what's the real default charset used by IO classes. There must be another default charset cached somewhere else.Interesting
@ZZ Coder, I'm still researching on that. The only think I know is that Charset.defaulyCharset() isn't called from sun.nio.cs.StreamEncoder in JVM 1.5. In JVM 1.6 the Charset.defaulyCharset() method is called giving the expected results. JVM 1.5 implementation of StreamEncoder is caching the previous encoding, somehow.Nica
FYI, there is a formal proposal to “Specify UTF-8 as the default charset for the Java SE APIs, so that APIs which depend on the default charset behave consistently across all JDK implementations and independently of the user’s operating system, locale, and configuration.” See JEP 400: UTF-8 by Default, updated 2021-03-30.Benghazi
N
67

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

Here are my results:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

(update)

Ok. I did reproduce your bug with JVM 1.5.

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

JVM 1.5:

public static Charset defaultCharset() {
    synchronized (Charset.class) {
        if (defaultCharset == null) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                return cs;
            return forName("UTF-8");
        }
        return defaultCharset;
    }
}

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses its own cache of the default charset that is set upon JVM initialization:

JVM 1.6:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Charset.defaultCharset().name();
    try {
        if (Charset.isSupported(csn))
            return new StreamEncoder(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
    throw new UnsupportedEncodingException (csn);
}

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Converters.getDefaultEncodingName();
    if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
        try {
            if (Charset.isSupported(csn))
                return new CharsetSE(out, lock, Charset.forName(csn));
        } catch (IllegalCharsetNameException x) { }
    }
    return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

Nica answered 17/11, 2009 at 14:25 Comment(2)
To reproduce this error, you must be on Java 5 and your JRE default encoding must be UTF-8.Interesting
This is writing to the implementation, not the abstraction. If you rely on undocumented stuff, don't be surprised if your code breaks when you upgrade to a newer version of the platform.Positively
P
25

Is this a bug or feature?

Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.

Bug ID: 4153515 on problems setting this property:

This is not a bug. The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.

I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.

If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.

Positively answered 17/11, 2009 at 14:30 Comment(0)
F
6

The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:

  • Charset.defaultCharset() is not caching the determined character set in Java 5.
  • Setting the system property "file.encoding" and invoking Charset.defaultCharset() again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset() defaults to "UTF-8".
  • The OutputStreamWriter is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset() if the system property "file.encoding" has been changed at runtime.

As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset() API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.

Falda answered 17/11, 2009 at 15:21 Comment(0)
A
4

First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?

You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.

I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.

But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.

Affairs answered 17/11, 2009 at 14:30 Comment(0)
L
3

I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.

Lucubrate answered 17/12, 2013 at 17:58 Comment(0)
P
2

check

System.getProperty("sun.jnu.encoding")

it seems to be the same encoding as the one used in your system's command line.

Penneypenni answered 14/12, 2012 at 3:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.