Does Java have methods to get the various byte order marks?

E

5

4

I am looking for a utility method or constant in Java that will return me the bytes that correspond to the appropriate byte order mark for an encoding, but I can't seem to find one. Is there one? I really would like to do something like:

byte[] bom = Charset.forName( CharEncoding.UTF8 ).getByteOrderMark();

Where CharEncoding comes from Apache Commons.

Engel answered 2/4, 2009 at 23:20 Comment(1)

have a look at #1835930 – Faitour 23/2, 2010 at 16:59

W

4

Java does not recognize byte order marks for UTF-8. See bugs 4508058 and 6378911.

The gist is that support was added, broke backwards compatibility, and was rolled back. You'll have to do BOM recognition in UTF-8 yourself.

Wingfooted answered 21/4, 2009 at 20:3 Comment(0)

T

3

Apache Commons IO contains what you are looking for, see org.apache.commons.io.ByteOrderMark.

Touber answered 13/9, 2012 at 15:41 Comment(0)

O

2

You can generate the BOM like this:

byte[] utf8_bom = "\uFEFF".getBytes("UTF-8");
byte[] utf16le_bom = "\uFEFF".getBytes("UnicodeLittleUnmarked");

If you wish to create the BOMs for other encodings using this method, make sure you use the version of the encoding that does not automatically insert the BOM or it will be repeated. This technique only applies to Unicode encodings and will not produce meaningful results for others (like Windows-1252).

Oxley answered 3/4, 2009 at 9:42 Comment(4)

My specific case is writing a CSV file that is UTF-8. As far as I can tell, the UTF-8 BOM is the only way to convince Excel to not attempt to read the file in the default character encoding. – Engel 3/4, 2009 at 13:27

There isn't a util method that will help you with your Excel file, but writing 0xEF 0xBF 0xBF to your OutputStream shouldn't be a problem. Just flush those bytes before you wrap your stream in a UTF-8 encoded Writer. – Oxley 4/4, 2009 at 23:41

I wouldn't say that "its usage is discouraged" by the FAQ. It is true that the UTF-8 BOM doesn't specify a "byte-order mark" (making it something of a misnomer), but it definitely helps in signifying that the stream uses UTF-8 encoding. – Wait 21/4, 2009 at 17:30

That's a fair comment - I've updated the post. I can't help feeling they used favour not using it, though: blogs.msdn.com/oldnewthing/archive/2007/04/17/2158334.aspx – Oxley 21/4, 2009 at 19:54

S

1

There isn't anything in the JDK as far as I can see, nor any of the Apache projects.

Eclipse EMF has an Enum however that provides support:

org.eclipse.emf.ecore.resource.ContentHandler.ByteOrderMark

I'm not sure whether that's of any help to you?

There's some more info here on the various BOM's for each encoding type, you could write a simple helper class or enum for this...

http://mindprod.com/jgloss/bom.html

Hope that helps. I'm surprised this isn't in Commons I/O to be honest.

Squirt answered 3/4, 2009 at 0:3 Comment(1)

It's in there now: commons.apache.org/io/apidocs/org/apache/commons/io/input/… – Ferrocene 12/9, 2011 at 15:27

P

1

It worth noting that many encodings don't use any byte order marks. e.g. an empty string in UTF-8 is just an empty byte[]. While there is a BOM specified for UTF-8 it is rarely used in Java and is not always supported.

Paucity answered 3/4, 2009 at 6:23 Comment(3)

Downvoted, because this seems to be incorrect as written. A three-byte sequence of bytes containing the UTF-8 BOM (EFBBBF) will be interpreted as an empty UTF-8 string if the application understands how to handle BOMs. (And if it doesn't, the BOM is going to cause trouble, empty string or no.) – Wait 21/4, 2009 at 17:27

Java doesn't understand BOMs for UTF-8. I've seen people get bit by this (text editor decided to add a BOM, javac puked.) – Wingfooted 21/4, 2009 at 20:15

Peter's answer is incorrect, see en.wikipedia.org/wiki/… – Lawtun 19/2, 2013 at 20:55

Recommended topics

Hot tags