How do I encode/decode UTF-16LE byte arrays with a BOM?
Asked Answered
B

5

25

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM:

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor:

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)
Beaconsfield answered 18/5, 2009 at 19:55 Comment(0)
R
32

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.

Roselleroselyn answered 18/5, 2009 at 20:8 Comment(6)
Thanks! One more issue though... Using "UTF-16" encodes the data as Big Endian, which I suspect will not go over well with Microsoft data (even though the BOM exists). Any way to encode UTF-16LE with BOM with Java? I'll update my question to reflect what I was really looking for...Beaconsfield
Click on the "see this post" link he gave. Basically, you stuff a \uFEFF character at the beginning of your string, and then encode to UTF-16LE, and the result will have a proper BOM.Aphrodisiac
Use "UnicodeLittle" (assuming your JRE supports it - ("\uEFFF" + "my string").getBytes("UTF-16LE") otherwise). Though I would be surprised if Microsoft APIs expected a BOM but couldn't handle big-endian data - they tend to like using BOMs more than other platforms. Test with empty strings - you may get empty arrays if there is no data.Roselleroselyn
I would be completely unsurprised at Microsoft defining a format where it expects a UTF-16LE BOM to begin a file and will not behave if the file begins with a UTF-8 BOM or a UTF-16BE BOM. I would be completely unsurprised because this is exactly the behavior I have observed with excel loading CSV files - if the file begins with a UTF-16LE BOM, then it loads the data in UTF-16LE and expects tabs between columns. Any other character sequence and it loads data in some local character set with "," or ";" (locale-dependent!) between columns.Aphrodisiac
Thanks for the Excel anecdote, @Daniel Martin. Exactly the kind of behavior I don't want to discover. :)Beaconsfield
Just to reiterate: "UnicodeLittle" (a.k.a. "x-UTF-16LE-BOM") will write the file as UTF-16 little-endian with a BOM. This should be the preferred method for WRITING the files, but it only seems to be available since Java 6 (JDK 1.6). For READING, you should stick with "UTF-16".Hatley
A
6

First off, for decoding you can use the character set "UTF-16"; that automatically detects an initial BOM. For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff.

For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. (Which you can get from Charset.forName("UTF-16LE").newEncoder()).

Aphrodisiac answered 18/5, 2009 at 20:15 Comment(0)
O
5

This is how you do it in nio:

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that.

Outgeneral answered 18/5, 2009 at 23:9 Comment(1)
This one actually doesn't work. The put(0) and put(1) calls overwrites the first two bytes of the encoded message's ByteBuffer.Chert
O
3
    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. Unfortunately the API doesn't give you that, as far as I know. (There was a method, but it is deprecated, and you can't specify encoding with it).

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done.

Outgeneral answered 18/5, 2009 at 20:9 Comment(3)
Thanks. In addition what I would have liked here is to not allocate the entire byte array with string.getBytes("UTF-16LE")--perhaps by wrapping the stream as an InputStream, which was the point of my earlier question: #838203Beaconsfield
Note that this code actually allocates arrays big enough for the String three times, since you have the internal array of the ByteArrayOutputStream which is copied in the call .toByteArray(). A way to get it back down to only allocating two is to wrap the ByteArrayOutputStream in an OutputStreamWriter and write the string to that. Then you still have the ByteArrayOutputStream's internal state and the copy made by .toByteArray(), but not the return value from .getBytesAphrodisiac
It seems that you are just exchanging a char array for a byte array if you do that, as the OutputStreamWriter delegates to the StreamEncoder class, which creates a char[] buffer to retrieve the String data. String is immutable, and the size of an array is invariable, so that copy seems unavoidable. I think nio is supposed to help with that double creation on the ByteArrayOutputStreamOutgeneral
C
0

This is an old question, but still, I couldn't find an acceptable answer for my situation. Basically, Java doesn't have a built-in encoder for UTF-16LE with a BOM. And so, you have to roll out your own implementation.

Here's what I ended up with:

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}
Chert answered 24/8, 2017 at 22:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.