Reliance on default encoding, what should I use and why?

D

5

28

FindBugs reports a bug:

Reliance on default encoding Found a call to a method which will perform a byte to String (or String to byte) conversion, and will assume that the default platform encoding is suitable. This will cause the application behaviour to vary between platforms. Use an alternative API and specify a charset name or Charset object explicitly.

I used FileReader like this (just a piece of code):

public ArrayList<String> getValuesFromFile(File file){
    String line;
    StringTokenizer token;
    ArrayList<String> list = null;
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader(file));
        list = new ArrayList<String>();
        while ((line = br.readLine())!=null){
            token = new StringTokenizer(line);
            token.nextToken();
            list.add(token.nextToken());
    ...

To correct the bug I need to change

br = new BufferedReader(new FileReader(file));

to

br = new BufferedReader(new InputStreamReader(new FileInputStream(file), Charset.defaultCharset()));

And when I use PrintWriter the same error occurred. So now I have a question. When I can (should) use FileReader and PrintWriter, if it's not good practice rely on default encoding? And the second question is to properly use Charset.defaultCharset ()? I decided use this method for automatically defining charset of the user's OS.

Dan answered 1/3, 2014 at 13:51 Comment(0)

A

26

If the file is under the control of your application, and if you want the file to be encoded in the platform's default encoding, then you can use the default platform encoding. Specifying it explicitely makes it clearer, for you and future maintainers, that this is your intention. This would be a reasonable default for a text editor, for example, which would then write files that any other editor on this platform would then be able to read.

If, on the other hand, you want to make sure that any possible character can be written in your file, you should use a universal encoding like UTF8.

And if the file comes from an external application, or is supposed to be compatible with an external application, then you should use the encoding that this external application expects.

What you must realize is that if you write a file like you're doing on a machine, and read it as you're doing on another machine, which doesn't have the same default encoding, you won't necessarily be able to read what you have written. Using a specific encoding, to write and read, like UTF8 makes sure the file will always be the same, whatever platform is used when writing the file.

Allie answered 1/3, 2014 at 13:57 Comment(4)

It may be worth suggesting that even when the default encoding is used, it's specified explicitly for clarity. – Jethro 1/3, 2014 at 13:59

You just did it :-) I added a sentence in the first paragraph. Thanks. – Allie 1/3, 2014 at 14:0

Ok, but what should I do if my application supposed to be compatible with an external application, but I don't know it's encoding. Whether the Charset.defaultCharset() method allows to determine this encoding? – Dan 1/3, 2014 at 14:7

Read the documentation of the external app. Use its GUI and try to discover which encoding it uses. Or use it to write all kinds of characters (ascii, occidental, chinese, etc.) and do the same thing yourself with various encodings, and compare the generated files to see which encoding is used. Good text editors have heuritics to try guessing the encoding used in a file. You could thus also try to open a file generated by the external app with such an editor and see what it guesses. – Allie 1/3, 2014 at 14:11

W

27

Ideally, it should be:

try (InputStream in = new FileInputStream(file);
     Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
     BufferedReader br = new BufferedReader(reader)) {

...or:

try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {

...assuming the file is encoded as UTF-8.

Pretty much every encoding that isn't a Unicode Transformation Format is obsolete for natural language data. There are languages you cannot support without Unicode.

Work answered 1/3, 2014 at 16:42 Comment(0)

A

26

If the file is under the control of your application, and if you want the file to be encoded in the platform's default encoding, then you can use the default platform encoding. Specifying it explicitely makes it clearer, for you and future maintainers, that this is your intention. This would be a reasonable default for a text editor, for example, which would then write files that any other editor on this platform would then be able to read.

If, on the other hand, you want to make sure that any possible character can be written in your file, you should use a universal encoding like UTF8.

And if the file comes from an external application, or is supposed to be compatible with an external application, then you should use the encoding that this external application expects.

What you must realize is that if you write a file like you're doing on a machine, and read it as you're doing on another machine, which doesn't have the same default encoding, you won't necessarily be able to read what you have written. Using a specific encoding, to write and read, like UTF8 makes sure the file will always be the same, whatever platform is used when writing the file.

Allie answered 1/3, 2014 at 13:57 Comment(4)

It may be worth suggesting that even when the default encoding is used, it's specified explicitly for clarity. – Jethro 1/3, 2014 at 13:59

You just did it :-) I added a sentence in the first paragraph. Thanks. – Allie 1/3, 2014 at 14:0

Ok, but what should I do if my application supposed to be compatible with an external application, but I don't know it's encoding. Whether the Charset.defaultCharset() method allows to determine this encoding? – Dan 1/3, 2014 at 14:7

Read the documentation of the external app. Use its GUI and try to discover which encoding it uses. Or use it to write all kinds of characters (ascii, occidental, chinese, etc.) and do the same thing yourself with various encodings, and compare the generated files to see which encoding is used. Good text editors have heuritics to try guessing the encoding used in a file. You could thus also try to open a file generated by the external app with such an editor and see what it guesses. – Allie 1/3, 2014 at 14:11

F

3

You should use default encoding whenever you read a file that is outside your application and can be assumed to be in the user's local encoding, for example user written text files. You might want to use the default encoding when writing such files, depending on what the user is going to do with that file later.

You should not use default encoding for any other file, especially application relevant files.

If you application for example writes configuration files in text format, you should always specify the encoding. In general UTF-8 is always a good choice, as it is compatible to almost everything. Not doing so might cause surprise crashes by users in other countries.

This is not only limited to character encoding, but as well to date/time, numeric or other language specific formats. If you for example use default encoding and default date/time strings on a US machine, then try to read that file on a German server, you might be surprised why one half is gibberish and the other half has month/days confused or is off by one hour because of daylight saving time.

Falcate answered 1/3, 2014 at 14:12 Comment(0)

B

2

When you are using a PrintWriter,

File file = new File(file_path);
Writer w = new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_16.name());
PrintWriter pw = new PrintWriter(w);
pw.println(content_to_write);
pw.close();

Brokerage answered 3/4, 2018 at 9:55 Comment(0)

G

0

This will work:-

FileReader file = new FileReader(csvFile, Charset.forName("UTF-8"));

BufferedReader csvReader = new BufferedReader(file);

Gainey answered 2/10, 2021 at 2:0 Comment(0)

Recommended topics

Hot tags