Writing UTF-8 without BOM
Asked Answered
W

2

9

This code,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes());

And this,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes(StandardCharsets.UTF_8));

produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM, but no encoding is being selected in the "Encoding" menu.

Now, this code write the file in UTF-8 with BOM encoding.

 OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
 byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
 out.write(bom);
 out.write("A".getBytes()); 

Notepad++ is also displaying the encoding type as Encode in UTF-8.

Question: What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?

Is notepad++ only guessing around?

Weathertight answered 4/11, 2013 at 13:31 Comment(2)
The letter A might be UTF-8, or ISO-646, or ISO-8859-1, or ISO-8859-2, or .... There's no way for notepad++ to guess that you are thinking UTF-8.Kappa
If you don't specify the encoding (first example) the JVM will use the operating system default encoding (ANSI for Windows, UTF-8 for Linux).Sweetandsour
A
17

"A" written using UTF-8 without a BOM produces exactly the same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.

Think of it this way:

  • "A".getBytes("UTF-8") returns a new byte[] { 65 }
  • "A".getBytes("ISO-8859-1") returns a new byte[] { 65 }
  • You write the results of those calls into a file
  • How is the consumer of the file supposed to distinguish the two?

There's nothing in that file that suggests that UTF-8 needs to be used to decode it.

Try writing "Käsekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tells it which encoding to use).

Actinotherapy answered 4/11, 2013 at 13:34 Comment(4)
Do you mean that notepad++ is only guesing around?Weathertight
@Mawia: yes, exactly. "Plain text" has no metadata that would tell it the encoding (except if there is a BOM, of course), so it uses a set of heuristics to guess which encoding is most likely. And that's not really the fault of Notepad++: there's nothing much you can do other than guessing (you could ask the user every time, but that would get annoying quickly).Actinotherapy
OK, I think that makes sense, 'cause when I write it in UTF-16, notepad++ is showing as Encode in UCS-2 Big Endian. So, notepad++ is simply guessing around, right?Weathertight
@Mawia: I already wrote in the answer that it guesses, I also confirmed it in my comment above. Are you waiting for a third confirmation? ;-) Some encodings have "more obvious" tells than others: UTF-16, for example can often be detected if every second byte is 0 (for english language text), while UTF-8 can be detected by some common sequences (and other byte sequences that can never occur in it). Other encodings can be "detected" by statistical analysis of the byte values. But all of that is really just guessing.Actinotherapy
R
0

I do not know if my answer is correct but let me put my understanding here,

As explained above if you write "A" simply notepad++ has no way to understand which type of encoding it is but if you want notepad++ to show "Encode in UTF-8 without BOM" as shown in figure below

enter image description here

Then you must fool Notepad++ which you can do it using following piece of code enter image description here

If you want notepad++ to show "Encode in UTF-8" then you should remove the substring part from osw.write("\uFEFF") because this is a BOM character which you are trying to insert. When you insert this character then the file encoding type would become "Encode to UTF-8" and when you remove programmatically then it would become "Encode in UTF-8 without BOM" as you have removed this BOM character.

Another setting you have to do is change the preferences of Notepad++ as shown below, By doing this only will the Notepad++ be able to recognize the encoding you want to.

enter image description here

However if you simply write text it would be treated as "ANSI" by notepad++.

Hope my explanation is clear and my analysis would help someone. However this approach is a work around and is not suggested but in a helpless scenario this works.

If you do not want your Notepad++ preferences to be changed and still you want the encoding to be "Encode in UTF-8 without BOM" then you must do something like this,

enter image description here

I have explained samething probably in a better way in my blog here

Remembrance answered 8/4, 2014 at 4:17 Comment(1)
A better understanding hereRemembrance

© 2022 - 2024 — McMap. All rights reserved.