what is meant by BOM? [closed]
Asked Answered
R

4

6

What is meant by BOM ? I tried reading this article but haven't really understood what does it mean.

I read that some text editors put BOM before the beginning of a file. What it is meant for ?

Reforest answered 12/10, 2012 at 13:27 Comment(3)
It is meant to tell the reader which encoding was used so it can be decoded.Bearberry
I'm assuming the Java tag was added for a reason, even if the OP didn't explicitly reference it. Java has some peculiarities when it comes to handling Unicode characters and so it may flavour the responses.Spiry
Why is this closed? it's a good question. And also the OP might want answers that has java code, api, etc. that's why OP added java tag.Outlet
V
17

BOM stands for Byte Order Mark. In short, the BOM is marker at the beginning of a file to indicate if the most significant byte, or the least significant byte should come first.

It causes a lot of problems, especially with UTF8. UTF8 does not use a BOM, but there is a variant called UTF8Y (Or UTF with BOM) that includes a few extra characters at the beginning of a file.

Sending a UTF8Y file, with a UTF8 encoding type, causes a few extra bytes to be sent at the beginning of the file and can cause all sorts of hard-to-track down problems including the DOCTYPE not being parsed correctly one IE or JSON files to fail to be decoded.

It has bitten me a few times with files from other people, when I didn't check the filetype carefully.

My recommendation: Be mindful it exists, never purposefully use it.

Verdaverdant answered 12/10, 2012 at 13:33 Comment(5)
+1 for "Be mindful it exists, never purposefully use it."Historiated
I complete agree about the uselessness of BOMs in UTF-8, but can you cite a reference for where UTF8Y is defined or where UTF8 "does not use a BOM"? The Unicode standard permits BOMs in UTF8 (but indicates they're pointless) and I can't find a reference to UTF8Y in the spec either.Spiry
Having said that, Google presents many results for UTF8Y. So perhaps it is a common deviation from the pure spec?Spiry
@DuncanJones --I only know the term UTF8Y from one of my editors at work (either jEdit or Notepad++). According to this chart, UTF8 does not have a BOM. ER, later down it does indicate that a BOM is optional, but has no affect on the actual byte order. Sounds like UTF8Y is an official name to separate it from UTF8 without BOM.Verdaverdant
" UTF8 does not use a BOM, but there is a variant called UTF8Y (Or UTF with BOM) that includes a few extra characters at the beginning of a file." The name is UTF-8 without BOM and still some extra characters are included in the file ! What is meant by it ?Reforest
S
5

A byte order mark allows a program to determine how to read Unicode data. From your Wiki page:

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in.

For UTF-8, there is no ambiguity over how to read the bytes and hence a BOM is often omitted. For UTF-16 and UTF-32 it is necessary to know how to interpret the bytes and a BOM can serve this purpose.

Note that Java has problems with reading UTF-8 BOMs and you must manually handle these characters if present (see Reading UTF-8 - BOM marker for some links to the related Sun bugs).

Spiry answered 12/10, 2012 at 13:33 Comment(2)
+1 for the heads up on Java problems.Verdaverdant
Yeah, that was a wasted afternoon finding that one :-) "WTF is this question mark still doing here?!"Spiry
T
2

I'm probably going to cover stuff you already know, but here goes...

To understand the purpose of a BOM, you need to understand (at least conceptually) what endian-ness is all about.

If you're dealing with a single byte (8 binary bits), it is ordered of increasing significance from right to left (just like reading a normal decimal number, like "19"). That's simple enough as long as you can contain the number in a single byte. Once you get to two bytes, you need to know which of the two bytes is more significant, which is either big endian or little endian. Big endian means that the lowest memory address (or the left-most, to continue the analogy to writing) contains the higher values - it continues the trend of Western decimal numbers. Historically, Intel has been little endian, and Motorola has been big endian. (I haven't looked lately, that may be different now.)

The BOM is simply a marker saying which way to interpret the byte order of the data.

Teamwork answered 12/10, 2012 at 13:44 Comment(0)
A
-1

Today, this is simply meant to say, "This file is in UTF-8". Or, "This file is in UTF-16". While it is still the same BOM character in both cases, the way the BOM is encoded implies how all the rest will be encoded.

If you do not know what the first character is, you cannot deduce the document encoding from it reliably - you have to determine it from somewhere else, or more or less guess it.

Post-downvote appendix:

Historically, the BOM had a different purpose - a zero width whitespace character (that is, as invisible as a Unicode character can be, but still a charater). Lots of widely used software libraries such as .NET and Java are adding the BOM automatically or implicitly to written files or even byte arrays, which often tricks people into thinking that they are not using the BOM when they do. This often backfires when a stack of such libraries writes multiple BOMs at the beginning of the same file, because then your file begins with an illegal or unwanted character, the zero width unbreakable space; and you do not even see it when you inspect!

No wonder the BOM technique does not have it good with everyone.

Act answered 12/10, 2012 at 13:33 Comment(2)
+1 Technically it will say UTF-16LE or UTF-16BE ;)Bearberry
@PeterLawrey - Thank you, and yes. I am simplifying the topic intentionally.Act

© 2022 - 2024 — McMap. All rights reserved.