What is meant by BOM ? I tried reading this article but haven't really understood what does it mean.
I read that some text editors put BOM before the beginning of a file. What it is meant for ?
What is meant by BOM ? I tried reading this article but haven't really understood what does it mean.
I read that some text editors put BOM before the beginning of a file. What it is meant for ?
BOM
stands for Byte Order Mark
. In short, the BOM
is marker at the beginning of a file to indicate if the most significant byte, or the least significant byte should come first.
It causes a lot of problems, especially with UTF8. UTF8 does not use a BOM, but there is a variant called UTF8Y (Or UTF with BOM) that includes a few extra characters at the beginning of a file.
Sending a UTF8Y file, with a UTF8 encoding type, causes a few extra bytes to be sent at the beginning of the file and can cause all sorts of hard-to-track down problems including the DOCTYPE not being parsed correctly one IE or JSON files to fail to be decoded.
It has bitten me a few times with files from other people, when I didn't check the filetype carefully.
My recommendation: Be mindful it exists, never purposefully use it.
jEdit
or Notepad++
). According to this chart, UTF8 does not have a BOM. ER, later down it does indicate that a BOM is optional, but has no affect on the actual byte order. Sounds like UTF8Y is an official name to separate it from UTF8 without BOM.
–
Verdaverdant UTF-8 without BOM
and still some extra characters are included in the file ! What is meant by it ? –
Reforest A byte order mark allows a program to determine how to read Unicode data. From your Wiki page:
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in.
For UTF-8, there is no ambiguity over how to read the bytes and hence a BOM is often omitted. For UTF-16 and UTF-32 it is necessary to know how to interpret the bytes and a BOM can serve this purpose.
Note that Java has problems with reading UTF-8 BOMs and you must manually handle these characters if present (see Reading UTF-8 - BOM marker for some links to the related Sun bugs).
I'm probably going to cover stuff you already know, but here goes...
To understand the purpose of a BOM, you need to understand (at least conceptually) what endian-ness is all about.
If you're dealing with a single byte (8 binary bits), it is ordered of increasing significance from right to left (just like reading a normal decimal number, like "19"). That's simple enough as long as you can contain the number in a single byte. Once you get to two bytes, you need to know which of the two bytes is more significant, which is either big endian or little endian. Big endian means that the lowest memory address (or the left-most, to continue the analogy to writing) contains the higher values - it continues the trend of Western decimal numbers. Historically, Intel has been little endian, and Motorola has been big endian. (I haven't looked lately, that may be different now.)
The BOM is simply a marker saying which way to interpret the byte order of the data.
Today, this is simply meant to say, "This file is in UTF-8". Or, "This file is in UTF-16". While it is still the same BOM character in both cases, the way the BOM is encoded implies how all the rest will be encoded.
If you do not know what the first character is, you cannot deduce the document encoding from it reliably - you have to determine it from somewhere else, or more or less guess it.
Post-downvote appendix:
Historically, the BOM had a different purpose - a zero width whitespace character (that is, as invisible as a Unicode character can be, but still a charater). Lots of widely used software libraries such as .NET and Java are adding the BOM automatically or implicitly to written files or even byte arrays, which often tricks people into thinking that they are not using the BOM when they do. This often backfires when a stack of such libraries writes multiple BOMs at the beginning of the same file, because then your file begins with an illegal or unwanted character, the zero width unbreakable space; and you do not even see it when you inspect!
No wonder the BOM technique does not have it good with everyone.
UTF-16LE
or UTF-16BE
;) –
Bearberry © 2022 - 2024 — McMap. All rights reserved.
java
tag. – Outlet