How to Remove BOM from an XML file in Java
Asked Answered
U

1

29

I need suggestions on the way to remove BOM from an UTF-8 file and create a copy of the rest of the xml file.

Unifoliolate answered 16/3, 2012 at 12:11 Comment(8)
the downvotes aren't because of a duplicate, they're because this question is too broad in nature - stackoverflow is for helping with specific, localised programming issues. We can help you debug a program, we won't write one for you.Statolatry
I'm waiting for this to be at -5 before answering ; )Lapides
@hari: what is the encoding of your file? UTF-8 ?Lapides
I am not asking people to write code for me... I am asking for suggestions on the way to do it. And i want to know if it is possible. People do blindly vote.Unifoliolate
@Lapides Ya. Its Utf-8 with Bom. If its Utf-8 without Bom the parser i use work.Unifoliolate
@Lapides i hope u r not serious abt the -5 voting :DUnifoliolate
@hari: do not worry, I'll answer you anyway...Lapides
@Unifoliolate en.wikipedia.org/wiki/Byte_order_markMaurice
L
43

Having a tool breaking because of a BOM in an UTF-8 file is a very common thing in my experience. I don't know why there where so many downvotes (but then it gives me the chance to try to get enough vote to win a special SO badge ; )

More seriously: an UTF-8 BOM doesn't typically make that much sense but it is fully valid (although discouraged) by the specs. Now the problem is that a lot of people aren't aware that a BOM is valid in UTF-8 and hence wrote broken tools / APIs that do not process correctly these files.

Now you may have two different issues: you may want to process the file from Java or you need to use Java to programmatically create/fix files that other (broken) tools need.

I've had the case in one consulting gig where the helpdesk would keep getting messages from users that had problems with some text editor that would mess up perfectly valid UTF-8 files produced by Java. So I had to work around that issue by making sure to remove the BOM from every single UTF-8 file we were dealing with.

I you want to delete a BOM from a file, you could create a new file and skip the first three bytes. For example:

... $  file  /tmp/src.txt 
/tmp/src.txt: UTF-8 Unicode (with BOM) English text

... $  ls -l  /tmp/src.txt 
-rw-rw-r-- 1 tact tact 1733 2012-03-16 14:29 /tmp/src.txt

... $  hexdump  -C  /tmp/src.txt | head -n 1
00000000  ef bb bf 50 6f 6b 65 ...

As you can see, the file starts with "ef bb bf", this is the (fully valid) UTF-8 BOM.

Here's a method that takes a file and makes a copy of it by skipping the first three bytes:

 public static void workAroundbrokenToolsAndAPIs(File sourceFile, File destFile) throws IOException {
    if(!destFile.exists()) {
        destFile.createNewFile();
    }

    FileChannel source = null;
    FileChannel destination = null;

    try {
        source = new FileInputStream(sourceFile).getChannel();
        source.position(3);
        destination = new FileOutputStream(destFile).getChannel();
        destination.transferFrom( source, 0, source.size() - 3 );
    }
    finally {
        if(source != null) {
            source.close();
        }
        if(destination != null) {
            destination.close();
        }
    }
}

Note that it's "raw": you'd typically want to first make sure you have a BOM before calling this or "Bad Thinks May Happen" [TM].

You can look at your file afterwards:

... $  file  /tmp/dst.txt 
/tmp/dst.txt: UTF-8 Unicode English text

... $  ls -l  /tmp/dst.txt 
-rw-rw-r-- 1 tact tact 1730 2012-03-16 14:41 /tmp/dst.txt

... $  hexdump -C /tmp/dst.txt
00000000  50 6f 6b 65 ...

And the BOM is gone...

Now if you simply want to transparently remove the BOM for one your broken Java API, then you could use the pushbackInputStream described here: why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?

private static InputStream checkForUtf8BOMAndDiscardIfAny(InputStream inputStream) throws IOException {
    PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
    byte[] bom = new byte[3];
    if (pushbackInputStream.read(bom) != -1) {
        if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
            pushbackInputStream.unread(bom);
        }
    }
    return pushbackInputStream; }

Note that this works, but shall definitely NOT fix the more serious issue where you can have other tools in the work chain not working correctly with UTF-8 files having a BOM.

And here's a link to a question with a more complete answer, covering other encodings as well:

Byte order mark screws up file reading in Java

Lapides answered 16/3, 2012 at 12:48 Comment(7)
Votes aren't a judgement on the subject of a question, they're a judgement on the quality of a question. Look at the the tooltips for the voting buttons.Chur
@skaffman: OK but instead of downvoting I asked if OP was using a UTF-8 file (which I suspected for that issue is all too common) and then added that to the question (and edited the tags). I don't know what else can be said: "How to remove a BOM from a file?" is pretty self-explanatory. I added "UTF-8". Of course it would have been easier for me to simply downvote ; )Lapides
@Lapides thanks a lot for ur suggestions.. I am sure that this would solve the pblm i had.Unifoliolate
@Lapides Tht solved the issue i had ! May i also know if there will be any problem if i remove the Bom from a file...Does anythng else depend on bom?Unifoliolate
@hari: well in a UTF-8 the specs states that it's best to not use any, but that the BOM is allowed. So normally the various tools should work if there's no BOM. Now I wouldn't surprised if there were a few tools out there that would rely on one and that could hence be described as "being broke the other way"... :(Lapides
hmm ..to be on the safer side i ll add it back if the file had any after parsing is done... thanks a lot again !Unifoliolate
Thanks for that little function snippet at the end. Not much of a java guy, didn't know about pushback. Much less hassle than importing apache commons. In general, if you write a long question then it won't get downvoted. If he had talked all about the program, it's inputs, blah blah, it would have been fine. Semantically the same, but accepted. :)Haft

© 2022 - 2024 — McMap. All rights reserved.