Is the byte order marker really a valid identifier?
Asked Answered
D

3

11

C++11 makes numerous additions to the list of Unicode code points allowed in identifiers (§E). This includes the byte order mark, which is included in range FE47-FFFD.

Consulting a character browser, this range includes a whole bunch of random stuff, beginning between WHITE SESAME DOT and PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET, including some "small punctuation," fancy Arabic, BOM appears here, the halfwidth and fullwidth Asian characters, and finally including the REPLACEMENT CHARACTER which is usually used to indicate broken text rendering.

Surely this is some kind of error. They felt the need to exclude "sesame dots," whatever those are, but the byte order mark a.k.a. deprecated zero-width non-breaking space is fair game? When there's another zero-width non-breaking space a.k.a. word joiner which was also made an acceptable identifier in C++11?

It seems the most elegant interpretation of the Standard, to define any form of Unicode as the source character set, is to begin the file after an optional BOM. But it's also possible for the user to legitimately begin the file by using a BOM for an identifier. It's just ugly.

Am I missing something, or is this a no-brainer defect?

Donation answered 22/11, 2011 at 13:31 Comment(3)
Don't tell me you're a sesame dot hater!Quemoy
I love me some sesame, and I would be glad to pepper my programs with it. Maybe that's why it's a specific exclusion… such vegetables as ☙ are also forbidden. It's not healthy, I tells ya.Donation
+1 seems like a dodgy design decision to me. Ignoring the issue of whether or not an identifier-character-BOM is stripped from the beginning of a file, I'd love to see if there's any deliberate rationale for allowing BOM and REPLACEMENT at all... they don't seem to provide anything useful, only potential traps.Kiwanis
L
4

My attempt at an interpretation: The standard only lays out the rules for an abstract piece of source code.

Your compiler comes with a notion of a "source character set", which tells it how a concrete source code file is encoded. If that encoding is "UTF-16" (i.e. without the BE/LE specifier, and thus requiring a BOM), then the BOM is not part of the codepoint stream, but just of the file envelope.

Only after the file has been decoded does the codepoint stream get passed on to the compiler proper.

Lesser answered 22/11, 2011 at 15:26 Comment(3)
Yeah, that's what I meant by "begin the file after an optional BOM." In the case of UTF-8, a BOM may also be used, but since the UTF-8 may be the default anyway, it's basically optional. And the user might have typed it literally as the first thing in the file.Donation
However, a BOM can appear anywhere in a Unicode document, not just UTF-16 and UTF-32, and not just as the first character.Flickinger
@DietrichEpp: That's not a contradiction. You can certainly have a BOM in the codepoint stream. I'm just saying that if "UTF-16" is your file encoding, the initial two bytes of the file do not form part of the encoded content. Potato: The UTF-8 envelope BOM is a different, three-byte sequence. Otherwise U+FEFF is just a regular codepoint.Lesser
C
4

First I want to say that the problem you're describing is unlikely to matter. If your compiler requires a UTF-8 BOM in order to treat a file as using the UTF-8 encoding, then you cannot have a file that lacks the UTF-8 BOM but where the source begins with U+FEFF in UTF-8 encoding. If your compiler does not require the UTF-8 BOM in order to process UTF-8 files, then you should not put UTF-8 BOMs in your source files (In the words of Michael Kaplan, "STOP USING WINDOWS NOTEPAD").

But yes, if the compiler strips BOMs then you can get behavior different from the intended. If you want (unwisely) to begin a source file with U+FEFF but (wisely) refuse to put BOMs in your source then you can use the universal character name: \uFEFF.

Now onto my answer.

The retrieval of physical source file characters is not defined by the C++ standard. Declaring source file encoding to the compiler, file formats for storing physical source characters, and mapping physical source file characters to the basic source charset is all implementation defined. Support for treating U+FEFF at the begining of a source file as an encoding hint lies in this area.

If a compiler supports an optional UTF-8 BOM and cannot distiguish between a file where the optional BOM is supplied from one where it is not but the source code begins with U+FEFF then this is a defect in the compiler design, and more broadly in the idea of the UTF-8 BOM itself.

In order to interpret bytes of data as text the text encoding must be known, determined unambiguously by an authoritative source. (Here's an article that makes this point.) Unfortunately back before this principal was understood data was already being transmitted between systems and people had to deal with data that was ostensibly text but for which the encoding wasn't necessarily known. So they came up with a very bad solution: guessing. A set of techniques involving the UTF-8 BOM is one of the methods of guessing that was developed.

The UTF-8 BOM was chosen as an encoding hint for a few reasons. First, it has no effect on visible text and so can be deliberately inserted into text without having a visible effect. Second, non-UTF-8 files are very unlikely to include bytes that will be mistaken for the UTF-8 BOM. However these don't prevent using a BOM from being anything other than guessing. There's nothing that says an ISO-8859-1 plain text file can't start with U+00EF U+00BB U+00BF, for example. This sequence of characters encoded in ISO-8859-1 shares the same encoding as U+FEFF encoded in UTF-8: 0xEF 0xBB 0xBF. Any software that relies on detecting a UTF-8 BOM will be confused by such an ISO-8859-1 file. So a BOM can't be an authoritative source even though guessing based on it will almost always work.

Aside from the fact that using the UTF-8 BOM amounts to guessing, there's a second reason that it's a terrible idea. That lies in the mistaken assumption that changes to text which have no effect on the visual display of that text have no effect at all. This assumption may be wrong whenever text is used for something other than visual display, such as when it's used in text meant to be read by a computer as source code is.

So in conclusion: This problem with the UTF-8 BOM is not caused by the C++ specification; and unless you're absolutely forced to interact with brain-dead programs that require it (in other words, programs that can only handle the subset of Unicode strings which begin with U+FEFF), do not use the UTF-8 BOM.

Curagh answered 22/11, 2011 at 20:31 Comment(14)
UTF-8 does not have a BOM, because UTF-8 does not have a byte-order at all. It's 8-bit; and 8-bit values are not changed by endian issues. The BOM is for UTF-16 (and UTF-32) or other non-8-bit encodings.Trig
It is true that UTF-8 does not have a byte ordering, however there is a thing people refer to as "UTF-8 BOM", which has nothing to do with determining any byte ordering. The thing which is referred to by this name is the UTF-8 encoding of U+FEFF (Zero Width No-Break Space) prepended to plain text. Originally my answer did include one reference to 'UTF-8 byte order mark', but I'll remove that so people don't get confused and thing BOM has something to do with byte ordering. ;)Curagh
@NicolBolas: The UTF-8 BOM exists and is 0xEF, 0xBB, 0xBF. It doesn't have any use, though, since UTF-8 doesn't have a byte ordering (what with its code unit only being one byte long and all). It's merely an indicator that this is indeed a UTF-8 file.Lesser
@KerrekSB Not a very good one though, as I detail in my answer.Curagh
@KerrekSB Where did I say the UTF-8 BOM was U+00EF U+00BB U+00BF? I refer to an ISO-8859-1 encoded file beginning with this sequence of characters (i.e., LATIN SMALL LETTER I WITH DIAERESIS, RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, INVERTED QUESTION MARK) but never called this sequence of characters the UTF-8 BOM. In fact I bring this sequence up precisely to point out that it may be confused with, but is not, the UTF-8 BOM.Curagh
@bames53: Ah, OK, I misunderstood. I'll delete the comment. Sorry!Lesser
@KerrekSB no worries. I've clarified this part of my answer too.Curagh
@bames53: Cheers. I suppose the BOM is only useful if you have a strictly bounded family consisting exclusively of BOMmed UTF encodings. In that case, you can tell everything you need to know from the BOM. I cannot imagine any situation where that would ever be useful.Lesser
"I can't think how a source file can legally begin with an identifier" - it can, for an included file. Yes, that is extremely unreasonable ;)Halsted
Unfortunately, I can't simply tell my users to "stop using Notepad" (and source from other programmers who do) and the byte sequence is legitimately inserted by other editors anyway. Besides, I purposely didn't specify UTF-8 or my own implementation in the question… I'm asking about the interaction between the C++ Standard and best practices recommended by the Unicode committee, be it UTF-16 or UTF-32 or whatever.Donation
@Donation Well, the answer is that you're not missing anything, but the defect is with the concept of the BOM. It's arbitrary whether zero width no-break space would count as a valid character in an identifier, or whitespace, or whatever. (Although I note that the Unicode standard specifically mentions allowing Zero Width Joiner in identifiers in some circumstances). But whatever C++ had specified you'd still have the problem that Unicode recommends that if a Unicode stream starts with a particular value then that value shouldn't be considered part of the data.Curagh
@Donation Note that the Unicode committee's recommended best practices here are for consuming text. Their recommended practices for emitting text do not involve the BOM. Instead they recommend alternative ways to determine byte ordering, and under no circumstances is any use of a BOM recommended for detecting or declaring the encoding scheme. Use of the BOM was developed by people unwilling or unable to use best practices for text encoding and Unicode.Curagh
@KerrekSB It's actually even less useful than that. Consider the little endian UTF-32 BOM encoding vs. the little endian UTF-16 BOM followed by U+0000: 0xFF 0xFE 0x00 0x00.Curagh
@bames53: Ah, yes, that's infuriatingly silly. Why couldn't they at least have come up with some 4-byte sequence that isn't valid UTF-16!Lesser
G
0

That part of the C++ specification (and your question) is linked to unicode specification. Think, in any normal unicode file there might come FFFE's (or whatever) inside a file, so how should we interpret those?

According to the unicode standard, a BOM char at the start of a stream/file is not regarded as a character and is ignored in presentation.

When they says 'C++ files can be in unicode format' they are bounding all C++ specification to the unicode specification as well. Here the unicode law also controls the C++ specification.

Because the unicode standard already defined this behaviour (skipping the BOM in the beginning), writers of the C++ standard had a good reason not to include this in their docs. Anyone implementing a unicode C++ compiler will take unicode standard into account as well.

Garb answered 22/11, 2011 at 20:4 Comment(2)
The C++ standard does not say 'C++ files can be in unicode format'. How physical source characters are stored is outside the scope of the C++ standard.Curagh
C++ source is not used for presentation (in this context). Although it's certainly obfuscating to have invisible identifiers, the BOM is but one of several such entities. The C++ standard names these codepoints as possible identifiers, so they cannot be ignored. Hence my question. Also, for what it's worth, official best practice is not to use a UTF-8 BOM, although they acknowledge it's one way to identify the UTF-8 format. This makes it hard to require users to go one way or the other.Donation

© 2022 - 2024 — McMap. All rights reserved.