Adding UTF-8 BOM to string/Blob
Asked Answered
I

6

94

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?

Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.

Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).

Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?

Yes, I really do need the UTF-8 BOM in this case.

Indict answered 26/7, 2013 at 10:37 Comment(0)
S
184

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx

See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.

See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF- 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Schick answered 26/7, 2013 at 10:51 Comment(6)
A warning for anyone else reading this: watch out, as \ufeff is actually the UTF-16 BOM and not the UTF-8 BOM en.wikipedia.org/wiki/Byte_order_markCalen
Great peace of code for the BOM encoding and works great! @Calen You are right ... I want to make a tsv file with tab separators and the tab charatacter for UTF-8 is /t .The same char as UTF-16 BE (BOM) is not working and i cannot find the corresponding char ... Do you know where to find or what char is the \t Thank you ... !Sizeable
@mEnE since \t (codepoint U+0009) is < 127, \t is 0x09 in UTF-8, just as it is in UTF-16 (0x0009). The only difference is the order the bytes are stored physically. In UTF-8 0x09. In UTF-16 LE 0x09, 0x00. In UTF-16 BE 0x00, 0x09.Calen
Just a small clarification: The character \uFEFF is the BOM character for all UTFs (8, 16 LE and 16 BE). However, it is encoded as bytes: - 0xEF 0xBB 0xBF - 0xFF 0xFE - 0xFE 0xFF respectively. It's important to distinguish the internal unicode character (\ufeff), and the various ways representing that one character, in bytes. :)Elaboration
Holy crap, this worked!! I used it in an HTML doc I was sending to my Kindle. THANK YOU Erik!Officious
thanks a lot. i've been searching for this a while!!Crossed
C
54

I had the same issue and this is the solution I came up with:

var blob = new Blob([
                    new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
                    "Text",
                    ... // Remaining data
                    ],
                    { type: "text/plain;charset=utf-8" });

Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).

You should replace text/plain with your desired MIME type.

Calen answered 28/12, 2016 at 13:25 Comment(1)
This is the correct way to do it when using Blob or working with actual bytes instead of JS strings. Erik and Jeff's answers are correct when you're using JS strings and not actual bytes.Rifkin
P
26

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.

The short answer is, yes, this code works.

The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.

http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.

UTF-8 Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
   /* The actual byte order mark written to the file is EF BB BF */
}

UTF-16 Little Endian Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
   /* The actual byte order mark written to the file is FF FE */
}

So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.

I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

Polynomial answered 16/1, 2015 at 0:46 Comment(9)
Well, if you look at the byte order mark and what I originally said, it's right. The FEFF byte order mark is not the byte order mark for UTF-8 as you stated in your question. The original answer seems to have stumbled onto the right answer or at least didn't elaborate at all. The only reason they got it right is because the options encoding defaults to utf-8. Not because the byte order mark they supplied is actually a UTF-8 byte order mark.Polynomial
lol, well, someone else will want to actually know how it works. Since the original answer doesn't describe why a UTF16LE BOM magically works. Someone in the future will want to actually understand what the heck is happening.Polynomial
Feel free to remove your mark down of my answer. It's not wrong.Polynomial
I'm a bit confused by this since the question doesn't mention node at all.Alexandriaalexandrian
Yes, I assumed that the original question was not a browser question. I assumed that they were experiencing the exact same issue that I was experiencing, within node.Polynomial
It's not really specific to Node at all; I think you're a bit confused about the byte order mark.Alexandriaalexandrian
Specifically, you can see here that the BOM is always the same character (U+FEFF), and not a different character depending on what type of Unicode or endianness the text is in. It's true that the bytes written are different but that's because the same character is being written with different encodings.Alexandriaalexandrian
Added some more details to the accepted answer to elaborate on why this works. Feel free to edit as you see fit.Louise
Might not be specific to Node. #6002756 A few people tried it in .NET and Java, and it worked too.Gales
U
11

This is my solution:

var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
Underwood answered 20/11, 2018 at 21:4 Comment(2)
Can you explain why this works please, and is utf-18 even a valid encodingHelbonia
I think charset=utf-18 is typo.Luganda
M
2

This works for me:

let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });

BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.

Example: When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.

Mckenna answered 18/10, 2022 at 8:7 Comment(0)
D
-1

This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:

const data = JSON.parse((await res.text()).trim());

Debenture answered 9/2, 2023 at 12:7 Comment(1)
The question asks how to add a BOM, not how to strip it.Indict

© 2022 - 2024 — McMap. All rights reserved.