Adding UTF-8 BOM to string/Blob

Asked 26/7, 2013 at 10:37 Answered 9/2, 2023 at 12:7

Solved javascript utf-8 blob fileapi byte-order-mark

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?

Using new Blob(['\xEF\xBB\xBF' + content]) yields 'ï»¿"my data"', of course.

Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).

Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?

^{Yes, I really do need the UTF-8 BOM in this case.}

Indict answered 26/7, 2013 at 10:37 Comment(0)

184

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx

See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.

See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF- 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Schick answered 26/7, 2013 at 10:51 Comment(6)

A warning for anyone else reading this: watch out, as \ufeff is actually the UTF-16 BOM and not the UTF-8 BOM en.wikipedia.org/wiki/Byte_order_mark – Calen 28/12, 2016 at 15:32

Great peace of code for the BOM encoding and works great! @Calen You are right ... I want to make a tsv file with tab separators and the tab charatacter for UTF-8 is /t .The same char as UTF-16 BE (BOM) is not working and i cannot find the corresponding char ... Do you know where to find or what char is the \t Thank you ... ! – Sizeable 19/7, 2017 at 8:25

@mEnE since \t (codepoint U+0009) is < 127, \t is 0x09 in UTF-8, just as it is in UTF-16 (0x0009). The only difference is the order the bytes are stored physically. In UTF-8 0x09. In UTF-16 LE 0x09, 0x00. In UTF-16 BE 0x00, 0x09. – Calen 22/7, 2017 at 13:46

Just a small clarification: The character \uFEFF is the BOM character for all UTFs (8, 16 LE and 16 BE). However, it is encoded as bytes: - 0xEF 0xBB 0xBF - 0xFF 0xFE - 0xFE 0xFF respectively. It's important to distinguish the internal unicode character (\ufeff), and the various ways representing that one character, in bytes. :) – Elaboration 30/12, 2017 at 7:14

Holy crap, this worked!! I used it in an HTML doc I was sending to my Kindle. THANK YOU Erik! – Officious 1/4, 2018 at 2:4

thanks a lot. i've been searching for this a while!! – Crossed 4/5, 2021 at 19:30

I had the same issue and this is the solution I came up with:

var blob = new Blob([
                    new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
                    "Text",
                    ... // Remaining data
                    ],
                    { type: "text/plain;charset=utf-8" });

Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).

You should replace text/plain with your desired MIME type.

Calen answered 28/12, 2016 at 13:25 Comment(1)

This is the correct way to do it when using Blob or working with actual bytes instead of JS strings. Erik and Jeff's answers are correct when you're using JS strings and not actual bytes. – Rifkin 26/3, 2019 at 16:10

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.

The short answer is, yes, this code works.

The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.

http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.

UTF-8 Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
   /* The actual byte order mark written to the file is EF BB BF */
}

UTF-16 Little Endian Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
   /* The actual byte order mark written to the file is FF FE */
}

So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.

I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

Polynomial answered 16/1, 2015 at 0:46 Comment(9)

Well, if you look at the byte order mark and what I originally said, it's right. The FEFF byte order mark is not the byte order mark for UTF-8 as you stated in your question. The original answer seems to have stumbled onto the right answer or at least didn't elaborate at all. The only reason they got it right is because the options encoding defaults to utf-8. Not because the byte order mark they supplied is actually a UTF-8 byte order mark. – Polynomial 16/1, 2015 at 19:8

lol, well, someone else will want to actually know how it works. Since the original answer doesn't describe why a UTF16LE BOM magically works. Someone in the future will want to actually understand what the heck is happening. – Polynomial 19/1, 2015 at 21:21

Feel free to remove your mark down of my answer. It's not wrong. – Polynomial 19/1, 2015 at 21:24

I'm a bit confused by this since the question doesn't mention node at all. – Alexandriaalexandrian 5/3, 2015 at 15:35

Yes, I assumed that the original question was not a browser question. I assumed that they were experiencing the exact same issue that I was experiencing, within node. – Polynomial 5/3, 2015 at 20:23

It's not really specific to Node at all; I think you're a bit confused about the byte order mark. – Alexandriaalexandrian 9/3, 2015 at 20:21

Specifically, you can see here that the BOM is always the same character (U+FEFF), and not a different character depending on what type of Unicode or endianness the text is in. It's true that the bytes written are different but that's because the same character is being written with different encodings. – Alexandriaalexandrian 9/3, 2015 at 20:27

Added some more details to the accepted answer to elaborate on why this works. Feel free to edit as you see fit. – Louise 3/1, 2017 at 10:23

Might not be specific to Node. #6002756 A few people tried it in .NET and Java, and it worked too. – Gales 10/6, 2019 at 1:22

This is my solution:

var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});

Underwood answered 20/11, 2018 at 21:4 Comment(2)

Can you explain why this works please, and is utf-18 even a valid encoding – Helbonia 16/7, 2021 at 8:6

I think charset=utf-18 is typo. – Luganda 31/8, 2023 at 10:25

This works for me:

let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });

BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.

Example: When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.

Mckenna answered 18/10, 2022 at 8:7 Comment(0)

-1

This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:

const data = JSON.parse((await res.text()).trim());

Debenture answered 9/2, 2023 at 12:7 Comment(1)

The question asks how to add a BOM, not how to strip it. – Indict 9/2, 2023 at 23:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags