Unicode BOM for UTF-16LE vs UTF32-LE
Asked Answered
C

3

9

It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:

FF FE 00 00 00 00 00 00

How can I tell if this file contains:

  1. The UTF16-LE BOM (FF FE) followed by 3 null characters; or
  2. The UTF32-LE BOM (FF FE 00 00) followed by one null character?

Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?

Cushion answered 18/12, 2009 at 18:36 Comment(0)
P
12

As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.

A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.

Peacoat answered 18/12, 2009 at 18:46 Comment(0)
M
9

It is unambiguous. FF FE is for UTF-16LE, and FF FE 00 00 denotes UTF-32LE. There is no reason to think that FF FE 00 00 is possibly UTF-16LE because the UTFs were designed for text, and users shouldn't be using NUL characters in their text. After all, when was the last time you opened a hex editor and inserted a few bytes of 00 into a text document? ^_^

Mete answered 18/12, 2009 at 18:51 Comment(3)
The null character may well be part of a higher-order protocol encoded in the text. Unicode doesn't actually care about what code points are used in text and U+0000 is just as valid as U+0041.Jewel
Reading a higher-order protocol, this theory conflicts with the question setting where the encoding has to be guessed. If you're reading a protocol, you don't guess the encoding.Denumerable
To put it another way, it's not impossible to have a U+0000 at the beginning of a file, but it's extremely rare. If this is a possibility for the data you're reading then you should not rely on a BOM for format detection.Distressed
M
1

I have experienced the same problem like Edward. I agree with Dustin, usually one will not use null-characters in textfiles.

However i have created a file that contains all unicode characters. I have first used the utf-32le encoding, then a utf-32be encoding, a utf-16le and a utf-16be encoding as well as a utf-8 encoding.

When trying to re-encode the files to utf-8, i wanted to compare the result to the already existing utf-8 file. Because the first character in my files after the BOM is the null-character, i could not successfully detect the file with utf-16le BOM, it showed up as utf-32le BOM, because the bytes appeared exactly like Edward has described. The first character after the BOM FFFE is 0000, but the BOM detection found a BOM FFFE0000 and so, detected utf-32le instead of utf-16le whereby my first 0000-character was stolen and taken as part of the BOM.

So one should never use a null-character as first character of a file encoded with utf-16 little endian, because it will make the utf-16le and utf-32le BOM ambiguous.

To solve my problem, i will swap the first and second character. :-)

Module answered 25/7, 2012 at 9:46 Comment(2)
If you rely on a BOM alone for detecting the encoding, then you need to look at more bytes than just the BOM to resolve the UTF-16/32 ambiguity. Check for UTF-16LE first, and if detected then check if the subsequent N*2 bytes are valid UTF-16LE, where N is a reasonable number. If not valid UTF-16LE, start over and assume UTF-32LE. U+0000 should be the only ambiguous codepoint, and there should not be many nulls at the start of the file. At some point, there has to be a cutoff, and if you still cannot resolve the ambiguity by then, prompt the user, or fail the processing with an error.Yellowtail
Which means, if one detects a utf-32le BOM, one should first check if it is really a utf-16le BOM with a U+0000 following codepoint. If there are a lot of words, this might help, possibly detecting also surrogates. But if there are a view words only, this can be hard. But i agree, when checking for valid utf-32 codepoints, possibly you will find codepoints beyond the 0x10FFFF maximum if it is really a utf-16 encoded file. Anyway we should recommend to always place another codepoint than U+0000 as the first codepoint within a utf-16le encoded file.Module

© 2022 - 2024 — McMap. All rights reserved.