UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)
Asked Answered
S

4

13

First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.

When I retrieve the templates from the DB, I decode them using template.decode('utf-8'). When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:

Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.

Chrome seems to generate an <html> tag automatically when it sees the BOM and mistakes it for content, making the real <html> tag an error.

So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?

For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8').

Note: I am using Python 2.5.

Thanks!

Subalternate answered 16/3, 2010 at 16:53 Comment(0)
W
24

Since you state:

All of my (text) files are currently stored in UTF-8 with the BOM

then use the 'utf-8-sig' codec to decode them:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works correctly if the BOM is not present as well.

Weightless answered 17/3, 2010 at 3:47 Comment(2)
Ooh! Very nice! I'll try this as soon as I can.Subalternate
Works beautifully (although Chrome mysteriously stopped giving the error no matter what, even with my old (wrong) code -- that's what I get for doing a whole bunch of changes at once).Subalternate
M
10

Check the first character after decoding to see if it's the BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]
Magnien answered 16/3, 2010 at 17:33 Comment(6)
Will u'\ufffe' ever occur at the beginning of a non-UTF-8 file? Wouldn't the BOM take two "characters" (read: bytes) in my case (UTF-8)?Subalternate
u'\ufffe' may be found at the beginning of any UTF- or UCS-encoded file. The BOM is three bytes in UTF-8, but it's still a single Unicode codepoint.Magnien
OK, so just to get this straight, I'd need to first decode the byte-content of the file using u = contents.decode('utf-8') and then I'd be able to use your method because the BOM is a single codepoint. Correct?Subalternate
@John: Calling getting the numbers mixed around "utterly wrong" is just slightly melodramatic, don't you think?Magnien
@Ignacio: I still think this answer is the best for my circumstances, however I suggest you edit your answer to use u'\ufeff' instead. It seems to be the correct order (when using the Unicode codepoint -- the order of the actual encoded bytes depends, which is the whole point of the BOM).Subalternate
@Ignacio: The effect of "getting the numbers mixed around" was to produce not the BOM but the AntiBOM -- utterly wrong, just like confusing Christ and the Antichrist. Before mucking about with ordnance, it's a good idea to read the instructions carefully c.f. the Holy Hand Grenade of Antioch.Dachau
D
1

The previously-accepted answer is WRONG.

u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

Here's a correct and typo/braino-resistant answer:

Decode your input into unicode_str. Then do this:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

Update Unfortunately there seems to be no suitable named constant in the standard Python library.

Alas, the codecs module provides only "a snare and a delusion":

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

Here verbatim unprettified from my own code is my solution to this:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

Dachau answered 16/3, 2010 at 22:50 Comment(5)
I fail to see how "ZERO WIDTH NO-BREAK SPACE", used here because it is also the BOM (pun intended), is any more legible than u"\uFEFF". They both require prior knowledge about the BOM to be understood.Subalternate
@Cameron: The legibility comes from giving whatever constant you use a name e.g. UNICODE_BOM.Dachau
@Cameron: I know nothing about the BOM, but I have a sense what a "zero width no-break space" is, and no idea what a u"\uFEFF" is. The latter is also harder to be sure that I've typed correctly, since its 8 character-length consists of only 3 alphanumeric characters, two of which closely resemble each other.Carolus
@Vickie: In this context, the "zero width no-break space" is not being used to represent a zero width no-break space at all (it's purpose it completely different -- look up BOM if you're curious), which is why I find it equally unhelpful to use it by name instead of by codepoint. @John: You're right, it's a good idea to use a symbolic name (like a constant) instead of the codepoint directly.Subalternate
@Cameron: The point of using the \N constant for ZWNBSP is that if you accidentally "mess up the order" you will get a SyntaxError immediately. The original purpose of ZWNBSP is now deprecated; "The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM". Unfortunately the unicodedata file doesn't include a mapping that would allow anything like u"\N{BYTE ORDER MARK}".Dachau
P
0

You can use something similar to remove BOM:

import os, codecs
def remove_bom_from_file(filename, newfilename):
    if os.path.isfile(filename):
        # open file
        f = open(filename,'rb')

        # read first 4 bytes
        header = f.read(4)

        # check if we have BOM...
        bom_len = 0
        encodings = [ ( codecs.BOM_UTF32, 4 ),
            ( codecs.BOM_UTF16, 2 ),
            ( codecs.BOM_UTF8, 3 ) ]

        # ... and remove appropriate number of bytes    
        for h, l in encodings:
            if header.startswith(h):
                bom_len = l
                break
        f.seek(0)
        f.read(bom_len)

        # copy the rest of file
        contents = f.read() 
        nf = open(newfilename)
        nf.write(contents)
        nf.close()
Plantain answered 16/3, 2010 at 17:11 Comment(5)
Hmm, don't you have to rewind the file after reading the first 4 bytes and before testing for BOMs? f.seek(0).Piperonal
@Konrad I missed that, thanks for pointing out. This is not production code anyway:].Plantain
Looks good to me (with the seek(0) fix), but I've already got the entire file in memory when I'm trying to chop the BOM -- how efficient is contents[2:] (for example) in Python? Does it create a copy of the entire string?Subalternate
I'd use this method if I was stripping the BOM while reading the file, but I'll be stripping the BOM with the file already in memory. Thanks for your reply though!Subalternate
This answer also has problems When reading a file, you need to check for FIVE (at least) possible BOMs, not three. See my answer.Dachau

© 2022 - 2024 — McMap. All rights reserved.