How to add a UTF-8 BOM in Java?
Asked Answered
P

9

29

I have a Java stored procedure which fetches record from the table using Resultset object and creates a CS Vfile.

BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);

ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
out.write('\ufeff');
out.flush();

zipOut.putNextEntry(new ZipEntry("filename.csv"));
while (rs.next()){
    out.print("\"" + rs.getString(i) + "\"");
    out.print(",");
}
out.flush();

zipOut.closeEntry();
zipOut.close();
retBLOB.close();

return retBLOB;

But the generated CSV file doesn't show the correct German character. Oracle database also has a NLS_CHARACTERSET value of UTF8.

Please suggest.

Purdum answered 8/12, 2010 at 15:10 Comment(8)
Just in case you haven't come across this before, note that the Unicode standard does not require or recommend using a BOM with UTF-8. It isn't illegal, either, but shouldn't be used indiscriminately. See here for the details, including some guidelines on when and where to use it. If you are trying to view the csv file in Windows, this is probably a valid use of the BOM.Tamis
Yes, we are trying to the view the csv in Windows, but the generated csv still shows garbled character for german characters. Is this the right way to set the BOM?Purdum
Yes, that’s right. The Unicode standard recommends against using a so-called BOM (it isn’t really) with UTF-8.Catiline
@tchrist: it recommends against using a BOM when dealing with software and protocols that excepts ASCII-only chars. If the OP knows that the Windows software he's using will use the BOM to detect that the file is actually encoded in UTF-8 (we don't care about the fact that it ain't a BOM, we care about the fact that it can allow some software to detect that the encoding is UTF-8). Also note that if you had a BOM to UTF-8 and some software fail, then these software are broken, because a BOM at the beginning of an UTF-8 is perfectly valid.Castrate
Of course the real issue here is that CSV file have no metadata nor specifications mandating the encoding of the file to be specified. It's basically the same old SNAFU that is also affecting .java file and many other crappy-underspec'ed file formats.Castrate
@Webinator: I realize that this is at best a partial solution to the problem, but I would really like to see a standard per-source-unit annotation like @encoding UTF-8 in Java files. I understand that this only works for supersets of ASCII like UTF-8, ISO 8859-?, MacRoman, or CP1252, and that it has to occur before any non-ASCII characters are seen. But this is the same restriction as in-band encoding specs in XML, Perl, and Python. I’m told it wouldn’t be not too hard to implement an annotator like that, but apart from regexes and encodings, my Java-fu is weak. Sure would be useful, eh?!Catiline
For the completeness of the BOM discussion. Excel 2003 strictly requires the BOM in UTF-8 encoded CSV files. Otherwise multibyte chars are unreadable.Gamble
I've recently been looking at behaviour of Microsoft Excel 2016. If a .csv file is renamed to .txt, or if a new Excel spreadsheet has data added "From Text", then the data is loaded by "Text Import Wizard". Apparently this is smart enough to recognise that it's receiving data of "File origin" of code page "65001 (UTF-8)", and if isn't, then you can tell it so. Then you have to tell it a few more things. I have written a little Cmd script to copy a BOM then data from one file into another file, to avoid that.Whalebone
C
12

To write a BOM in UTF-8 you need PrintStream.print(), not PrintStream.write().

Also if you want to have BOM in your csv file, I guess you need to print a BOM after putNextEntry().

Comber answered 8/12, 2010 at 15:41 Comment(3)
Aren’t all PrintStreams fundamentally flawed because they discard all errors that may occur on the stream, including I/O errors, full filesystems, network interruptions, and encoding mismatches? If this is not true, could you please tell me how to make them reliable (because I want to use them)? But if it is true, could you please explain when it could ever be appropriate to use an output method that suppresses correctness concerns? This is a serious question, because I don’t understand why this was set up to be so dangerous. Thanks for any insights.Catiline
@Catiline - it is true that PrintStreams suppress errors. However ... 1) they are not entirely discarded - you can check to see if an error has occurred. 2) There are cases where you don't need to know about errors. An indisputable case is when you are sending characters to a stream that is writing to an in-memory buffer.Crossarm
@Catiline I guess, this is all caused by using checked exceptions. Normally, you'd just throw on any error and be happy. You could make an existing PrintStream "safe" by wrapping each call and adding checkError and conditionally throw. But the information about the exception is lost. So yes, PrintStream is a hopeless crap.Antipyrine
W
83
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(...), StandardCharsets.UTF_8));
out.write('\ufeff');
out.write(...);

This correctly writes out 0xEF 0xBB 0xBF to the file, which is the UTF-8 representation of the BOM.

Wamsley answered 14/11, 2011 at 11:18 Comment(1)
This code is sensitive to default platform encoding. On Windows, I ended up with 0x3F written to the file. The correct way to get the BufferedWriter is: BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(the File), StandardCharsets.UTF_8))Astro
B
16

Just in case people are using PrintStreams, you need to do it a little differently. While a Writer will do some magic to convert a single byte into 3 bytes, a PrintStream requires all 3 bytes of the UTF-8 BOM individually:

    // Print utf-8 BOM
    PrintStream out = System.out;
    out.write('\ufeef'); // emits 0xef
    out.write('\ufebb'); // emits 0xbb
    out.write('\ufebf'); // emits 0xbf

Alternatively, you can use the hex values for those directly:

    PrintStream out = System.out;
    out.write(0xef); // emits 0xef
    out.write(0xbb); // emits 0xbb
    out.write(0xbf); // emits 0xbf
Bibliology answered 30/3, 2016 at 14:29 Comment(0)
C
12

To write a BOM in UTF-8 you need PrintStream.print(), not PrintStream.write().

Also if you want to have BOM in your csv file, I guess you need to print a BOM after putNextEntry().

Comber answered 8/12, 2010 at 15:41 Comment(3)
Aren’t all PrintStreams fundamentally flawed because they discard all errors that may occur on the stream, including I/O errors, full filesystems, network interruptions, and encoding mismatches? If this is not true, could you please tell me how to make them reliable (because I want to use them)? But if it is true, could you please explain when it could ever be appropriate to use an output method that suppresses correctness concerns? This is a serious question, because I don’t understand why this was set up to be so dangerous. Thanks for any insights.Catiline
@Catiline - it is true that PrintStreams suppress errors. However ... 1) they are not entirely discarded - you can check to see if an error has occurred. 2) There are cases where you don't need to know about errors. An indisputable case is when you are sending characters to a stream that is writing to an in-memory buffer.Crossarm
@Catiline I guess, this is all caused by using checked exceptions. Normally, you'd just throw on any error and be happy. You could make an existing PrintStream "safe" by wrapping each call and adding checkError and conditionally throw. But the information about the exception is lost. So yes, PrintStream is a hopeless crap.Antipyrine
C
10

PrintStream#print

I think that out.write('\ufeff'); should actually be out.print('\ufeff');, calling the java.io.PrintStream#print method.

According the javadoc, the write(int) method actually writes a byte ... without any character encoding. So out.write('\ufeff'); writes the byte 0xff. By contrast, the print(char) method encodes the character as one or bytes using the stream's encoding, and then writes those bytes.

As noted in section 23.8 of the Unicode 9 specification, the BOM for UTF-8 is EF BB BF. That sequence is what you get when using UTF-8 encoding on '\ufeff'. See: Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff?.

Crossarm answered 8/12, 2010 at 15:42 Comment(2)
Isn’t the only safe way to do encoded output in Java is to use the rarely-seen OutputStreamWriter(OutputStream out, CharsetEncoder enc) for of the constructor, the only one of the four with an explicit CharsetEncoder argument, and never using the PrintStream that you’ve recommended here?Catiline
@Catiline - 1) No. 2) I didn't recommend PrintStream. I simply said how to do what the OP asked to do using the PrintStream he was already using. 3) In this case PrintStream should be safe because because it is followed by other actions that will cause writes to the underlying stream (socket) and throw an exception if the previous PrintStream writes had silently failed.Crossarm
S
8

You Add This For First Of CSV String

String CSV = "";
byte[] BOM = {(byte) 0xEF,(byte) 0xBB,(byte) 0xBF};
CSV = new String(BOM) + CSV;

This Work For Me.

Skeie answered 15/7, 2020 at 15:48 Comment(0)
Q
1

If you just want to

modify the same file

(without new file and delete old one as I had issues with that)

private void addBOM(File fileInput) throws IOException {
    try (RandomAccessFile file = new RandomAccessFile(fileInput, "rws")) {
        byte[] text = new byte[(int) file.length()];
        file.readFully(text);
        file.seek(0);
        byte[] bom = { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF };
        file.write(bom);
        file.write(text);
    }
}
Quickstep answered 24/6, 2021 at 14:3 Comment(0)
H
1

Using StringBuilder

StringBuilder csv = new StringBuilder();    
csv.append('\ufeff');
csv.append(content);
csv.toString();
Huda answered 17/5, 2023 at 11:38 Comment(0)
Q
0

In my case it works with the code:

PrintWriter out = new PrintWriter(new File(filePath), "UTF-8");
out.write(csvContent);
out.flush();
out.close();
Quadrilateral answered 19/12, 2013 at 9:1 Comment(0)
M
0

Here a simple way to append BOM header on any file :

private static void appendBOM(File file) throws Exception {
    File bomFile = new File(file + ".bom");
    try (FileOutputStream output = new FileOutputStream(bomFile, true)) {
        byte[] bytes = FileUtils.readFileToByteArray(file);
        output.write('\ufeef'); // emits 0xef
        output.write('\ufebb'); // emits 0xbb
        output.write('\ufebf'); // emits 0xbf
        output.write(bytes);
        output.flush();
    }
    
    file.delete();
    bomFile.renameTo(file);
}
Margarettamargarette answered 22/12, 2020 at 15:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.