Setting the default Java character encoding
Asked Answered
S

18

412

How do I properly set the default character encoding used by the JVM (1.5.x) programmatically?

I have read that -Dfile.encoding=whatever used to be the way to go for older JVMs. I don't have that luxury for reasons I wont get into.

I have tried:

System.setProperty("file.encoding", "UTF-8");

And the property gets set, but it doesn't seem to cause the final getBytes call below to use UTF8:

System.setProperty("file.encoding", "UTF-8");

byte inbytes[] = new byte[1024];

FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
FileOutputStream fos = new FileOutputStream("response-2.txt");
String in = new String(inbytes, "UTF8");
fos.write(in.getBytes());
Symphonia answered 12/12, 2008 at 5:31 Comment(6)
Excellent comments guys - and things I was already thinking myself. Unfortunately there is an underlying String.getBytes() call that I have no control over. The only way I currently see to get around it is to set the default encoding programmatically. Any other suggestions?Symphonia
maybe irrelevant question but, is there difference when UTF8 is set with "UTF8", "UTF-8" or "utf8". Recently I found that IBM WAS 6.1 EJB and WEB containers differently treats (in way of case-sensitivity) strings used to define encoding.Rex
Just a detail but: prefer UTF-8 to UTF8 (only the former is standard). This still applies in 2012...Motorbus
Setting or reading the file.encoding property is not supported.Escaut
@erickson Am still not clear with the query, Is it not true that, "file.encoding" is relevant when character based I/O streams are used(all subclasses of class Reader & class Writer)? Because class FileInputStream is byte based I/O stream, so why one should care about character set in byte-based I/O stream?Caparison
McDowell's comment should get more attention. The bug he linked to in the Oracle Java Bug Database (a working link here) got rejected with the evaluation saying: The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.Pitchblack
J
365

Unfortunately, the file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding used by String.getBytes() and the default constructors of InputStreamReader and OutputStreamWriter has been permanently cached.

As another user points out, in a special case like this, the environment variable JAVA_TOOL_OPTIONS can be used to specify this property, but it's normally done like this:

java -Dfile.encoding=UTF-8 … com.x.Main

Charset.defaultCharset() will reflect changes to the file.encoding property, but most of the code in the core Java libraries that need to determine the default character encoding do not use this mechanism.

When you are encoding or decoding, you can query the file.encoding property or Charset.defaultCharset() to find the current default encoding, and use the appropriate method or constructor overload to specify it.

Jagannath answered 12/12, 2008 at 5:56 Comment(4)
For completeness I would like to add that with a bit of trickery you can get to the actually used default encoding (as is cached), thanks to Gary Cronin: byte [] byteArray = {'a'}; InputStream inputStream = new ByteArrayInputStream(byteArray); InputStreamReader reader = new InputStreamReader(inputStream); String defaultEncoding = reader.getEncoding(); lists.xcf.berkeley.edu/lists/advanced-java/1999-October/…Leanto
JDK-4163515 has some more info on setting the file.encoding sysprop after JVM startup.Iyar
I was scratching my head cause that command was not working on Windows, linux and mac perfectly... then i put " around the value like this: java -D"file.encoding=UTF-8" -jarBuffybuford
check my answer in case of Java Spring Boot: https://mcmap.net/q/86231/-setting-the-default-java-character-encodingFylfot
T
193

From the JVM™ Tool Interface documentation…

Since the command-line cannot always be accessed or modified, for example in embedded VMs or simply VMs launched deep within scripts, a JAVA_TOOL_OPTIONS variable is provided so that agents may be launched in these cases.

By setting the (Windows) environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8, the (Java) System property will be set automatically every time a JVM is started. You will know that the parameter has been picked up because the following message will be posted to System.err:

Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8

Terrilynterrine answered 8/3, 2009 at 4:31 Comment(3)
Do you know that "Picked up..." statement would be printed in Tomcat logs?Assess
Hi Edward Grech I thank you for your solution. It was resolved my problmem in another forum post. #14814730Tallowy
@Tiny Java understands both. #6032377Roana
P
81

I have a hacky way that definitely works!!

System.setProperty("file.encoding","UTF-8");
Field charset = Charset.class.getDeclaredField("defaultCharset");
charset.setAccessible(true);
charset.set(null,null);

This way you are going to trick JVM which would think that charset is not set and make it to set it again to UTF-8, on runtime!

Potentate answered 20/2, 2013 at 19:9 Comment(13)
NoSuchFieldException for meLamphere
For the hack to work, you need to assume the security manager is off. If you don't have a way to set a JVM flag, you might (probably) have a security manager enabled system as well.Longlegged
Though i haven't understood what it is, it works fine for me! Thanks. Hope it doesn't create any new issues to my app. Cheers!Aelber
This worked for me, but the underlying issue was the ssh connections to spin up or jars had its LC_* set wrong (in the profile).Salazar
JDK9 does not approve of this hack anymore. WARNING: An illegal reflective access operation has occurred • WARNING: Illegal reflective access by [..] • WARNING: Please consider reporting this to the maintainers of [..] • WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations • WARNING: All illegal access operations will be denied in a future releaseBookish
@Enerccio: That's not a good answer, that's a dirty hack, and a problem waiting to happen. That should only be used as an emergency measure.Pringle
@Pringle problem is that java should have a way to override this but alas they don't, so this is a good answer because it is the ONLY answerCasemate
@Enerccio: It's arguable whether Java "should" have a way to set this - one could also argue that developers "should" explicitly specify encoding whenever it's relevant. At any rate, this solution has the potential to cause serious trouble in the longer run, hence the "for emergency use only" caveat. Actually, even emergency use is questionable, because there is a supported way of doing it, setting JAVA_TOOL_OPTIONS as explained in another answer.Pringle
@Pringle all other solutions can't change during the runtime... and if you have library that uses default encoding and you might not have sources and you must use that library, this is only working solution...Casemate
@Enerccio: If this solution works, using JAVA_TOOL_OPTIONS should work too, and is actually a supported solution.Pringle
For me, just setting the system property helped fix my encoding problem where the IDE used UTF-8 and the JAR file the default system encoding which bugged my resource bundle strings.Ubangishari
I confirm this work on JRE 1.8 and chinese windows 10!Choiseul
@Bookish The illegal reflective access warning (respectively error in modern JDKs) can be worked around via --add-opens=java.base/java.nio.charset=ALL-UNNAMED. Though i'd be hesitant to re-open internals in production, in my case it feels OK for test executions, so simply adding it to <argLine> config of surefire or failsafe does the trick.Stoughton
R
40

I think a better approach than setting the platform's default character set, especially as you seem to have restrictions on affecting the application deployment, let alone the platform, is to call the much safer String.getBytes("charsetName"). That way your application is not dependent on things beyond its control.

I personally feel that String.getBytes() should be deprecated, as it has caused serious problems in a number of cases I have seen, where the developer did not account for the default charset possibly changing.

Rayleigh answered 12/12, 2008 at 5:39 Comment(1)
Yes. The default encoding is there solely to mirror the locale set in the underlying operating system, noone should change that. ALWAYS specify the encoding if you don't want to use the one resulting from the system locale.Pitchblack
S
20

I can't answer your original question but I would like to offer you some advice -- don't depend on the JVM's default encoding. It's always best to explicitly specify the desired encoding (i.e. "UTF-8") in your code. That way, you know it will work even across different systems and JVM configurations.

Swelter answered 12/12, 2008 at 5:36 Comment(5)
Except, of course, if you're writing a desktop app and processing some user-specified text that does not have any encoding metadata - then the platform default encoding is your best guess as to what the user might be using.Exsect
@MichaelBorgwardt "then the platform default encoding is your best guess" you seem to be advising that wanting to change the default is not such a good idea. Do you mean, use an explicit encoding wherever possible, using the supplied dafault when nothing else is possible?Gusty
@Raedwald: yes, that's what I meant. The platform default encoding is (at least on an end user machine) what users in the locale the system is set to are typically using. That is information you should use if you have no better (i.e. document-specific) information.Exsect
@MichaelBorgwardt Nonsense. Use a library to auto-detect the input encoding, and save as Unicode with BOM. That is the only way to deal with and fight encoding hell.Blakley
I think you two are not in the same page. Michael talks about decoding while Raedwald you talk about processing after decoding.Marquardt
P
15

Try this:

new OutputStreamWriter(new FileOutputStream("Your_file_fullpath" ),Charset.forName("UTF8"))
Paunchy answered 20/1, 2012 at 18:9 Comment(0)
B
7

We were having the same issues. We methodically tried several suggestions from this article (and others) to no avail. We also tried adding the -Dfile.encoding=UTF8 and nothing seemed to be working.

For people that are having this issue, the following article finally helped us track down describes how the locale setting can break unicode/UTF-8 in Java/Tomcat

http://www.jvmhost.com/articles/locale-breaks-unicode-utf-8-java-tomcat

Setting the locale correctly in the ~/.bashrc file worked for us.

Breana answered 9/1, 2014 at 0:46 Comment(0)
I
7

I have tried a lot of things, but the sample code here works perfect. Link

The crux of the code is:

String s = "एक गाव में एक किसान";
String out = new String(s.getBytes("UTF-8"), "ISO-8859-1");
Interfile answered 3/7, 2014 at 9:33 Comment(0)
F
7

In case you are using Spring Boot and want to pass the argument file.encoding in JVM you have to run it like that:

mvn spring-boot:run -Drun.jvmArguments="-Dfile.encoding=UTF-8"

this was needed for us since we were using JTwig templates and the operating system had ANSI_X3.4-1968 that we found out through System.out.println(System.getProperty("file.encoding"));

Hope this helps someone!

Fylfot answered 23/2, 2018 at 17:1 Comment(0)
A
5

My team encountered the same issue in machines with Windows.. then managed to resolve it in two ways:

a) Set environment variable (even in Windows system preferences)

JAVA_TOOL_OPTIONS
-Dfile.encoding=UTF8

b) Introduce following snippet to your pom.xml:

 -Dfile.encoding=UTF-8 

WITHIN

 <jvmArguments>
 -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8001
 -Dfile.encoding=UTF-8
 </jvmArguments>
Anabal answered 3/7, 2019 at 12:58 Comment(0)
P
2

I'm using Amazon (AWS) Elastic Beanstalk and successfully changed it to UTF-8.

In Elastic Beanstalk, go to Configuration > Software, "Environment properties". Add (name) JAVA_TOOL_OPTIONS with (value) -Dfile.encoding=UTF8

After saving, the environment will restart with the UTF-8 encoding.

Peregrination answered 24/4, 2018 at 8:59 Comment(0)
R
1

Not clear on what you do and don't have control over at this point. If you can interpose a different OutputStream class on the destination file, you could use a subtype of OutputStream which converts Strings to bytes under a charset you define, say UTF-8 by default. If modified UTF-8 is suffcient for your needs, you can use DataOutputStream.writeUTF(String):

byte inbytes[] = new byte[1024];
FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
String in = new String(inbytes, "UTF8");
DataOutputStream out = new DataOutputStream(new FileOutputStream("response-2.txt"));
out.writeUTF(in); // no getBytes() here

If this approach is not feasible, it may help if you clarify here exactly what you can and can't control in terms of data flow and execution environment (though I know that's sometimes easier said than determined). Good luck.

Rayleigh answered 16/12, 2008 at 3:59 Comment(1)
DataInputStream and DataOutputStream are special-purpose classes that should never be used with plain text files. The modified UTF-8 they employ is not compatible with real UTF-8. Besides, if the OP could use your solution, he could also use the right tool for this job: an OutputStreamWriter.Pulchi
A
1
mvn clean install -Dfile.encoding=UTF-8 -Dmaven.repo.local=/path-to-m2

command worked with exec-maven-plugin to resolve following error while configuring a jenkins task.

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Error occurred during initialization of VM
java.nio.charset.IllegalCharsetNameException: "UTF-8"
    at java.nio.charset.Charset.checkName(Charset.java:315)
    at java.nio.charset.Charset.lookup2(Charset.java:484)
    at java.nio.charset.Charset.lookup(Charset.java:464)
    at java.nio.charset.Charset.defaultCharset(Charset.java:609)
    at sun.nio.cs.StreamEncoder.forOutputStreamWriter(StreamEncoder.java:56)
    at java.io.OutputStreamWriter.<init>(OutputStreamWriter.java:111)
    at java.io.PrintStream.<init>(PrintStream.java:104)
    at java.io.PrintStream.<init>(PrintStream.java:151)
    at java.lang.System.newPrintStream(System.java:1148)
    at java.lang.System.initializeSystemClass(System.java:1192)
Adopt answered 6/3, 2018 at 8:28 Comment(0)
F
1

Solve this problem in my project. Hope it helps someone.

I use LIBGDX java framework and also had this issue in my android studio project. In Mac OS encoding is correct, but in Windows 10 special characters and symbols and also russian characters show as questions like: ????? and other incorrect symbols.

  1. Change in android studio project settings: File->Settings...->Editor-> File Encodings to UTF-8 in all three fields (Global Encoding, Project Encoding and Default below).

  2. In any java file set:

    System.setProperty("file.encoding","UTF-8");

  3. And for test print debug log:

    System.out.println("My project encoding is : "+ Charset.defaultCharset());

Fit answered 7/8, 2020 at 13:52 Comment(0)
B
1

Setting up jvm arguments while starting application helped me resolve this issue. java -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8.

file.encoding=UTF-8 - This helps to have the Unicode characters in the file.

sun.jnu.encoding=UTF-8 - This helps to have the Unicode characters as the File name in the file system.

Broads answered 5/11, 2021 at 6:36 Comment(0)
B
0

We set there two system properties together and it makes the system take everything into utf8

file.encoding=UTF8
client.encoding.override=UTF-8
Boarding answered 19/1, 2012 at 19:23 Comment(1)
The client.encoding.override property seems to be WebSphere specific.Motorbus
S
0

Following @Caspar comment on accepted answer, the preferred way to fix this according to Sun is :

"change the locale of the underlying platform before starting your Java program."

http://bugs.java.com/view_bug.do?bug_id=4163515

For docker see:

http://jaredmarkell.com/docker-and-locales/

Scanderbeg answered 5/10, 2017 at 15:40 Comment(0)
N
0

Recently I bumped into a local company's Notes 6.5 system and found out the webmail would show unidentifiable characters on a non-Zhongwen localed Windows installation. Have dug for several weeks online, figured it out just few minutes ago:

In Java properties, add the following string to Runtime Parameters

-Dfile.encoding=MS950 -Duser.language=zh -Duser.country=TW -Dsun.jnu.encoding=MS950

UTF-8 setting would not work in this case.

Norford answered 14/10, 2017 at 17:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.