is unicode( codecs.BOM_UTF8, "utf8" ) necessary in Python 2.7/3?

Asked 11/11, 2011 at 15:20 Answered 11/11, 2011 at 17:45

Solved python unicode utf-8 byte-order-mark

In a code review I came across the following code:

# Python bug that renders the unicode identifier (0xEF 0xBB 0xBF)
# as a character.
# If untreated, it can prevent the page from validating or rendering 
# properly. 
bom = unicode( codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

This is in a function that passes a string to Response object (Django or Flask).

Is this still a bug that needs this fix in Python 2.7 or 3? Something tells me it isn't, but I thought I'd ask because I don't know this problem very well.

I'm not sure where this came from, but I've seen it around the Internet, referenced sometimes in association with Jinja2 (which we are using).

Thanks for reading.

Byer answered 11/11, 2011 at 15:20 Comment(4)

If you encountered it in a code review, maybe you could ask the author where the code initially did come from and if there is some test case for it? Because I’ve never seen it before, and don’t think there is a real need for it (at least I never had a problem with not doing it). – Stockroom 11/11, 2011 at 15:30

@Stockroom there is a good chance the author was me. lol – Byer 3/2, 2015 at 18:0

Now that’s a late reply xD – Stockroom 3/2, 2015 at 18:4

@Stockroom Better late than never. :o) – Byer 3/2, 2015 at 18:31

The Unicode standard states that the character \ufeff has two distinct meanings. At the start of a data stream, it should be used as a byte-order and/or encoding signature, but elsewhere it should be interpreted as a zero-width non-breaking space.

So the code

bom = unicode(codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

isn't just removing the utf-8 encoding signature (aka BOM) - it's also removing any embedded zero-width non-breaking spaces.

Some earlier versions of python did not have a variant of the "utf-8" codec which skips the BOM when reading data streams. Since this was inconsistent with the other other unicode codecs, a "utf-8-sig" codec was introduced with version 2.5, which does skip the BOM.

So it's possible the "Python bug" mentioned in the code comments relates to that.

However, it seems more likely that the "bug" relates to embedded \ufeff characters. But since the Unicode Standard clearly states they can be interpreted as legitimate characters, it is really up to the data consumer to decide how to treat them - and therefore not a bug in python.

Mendelssohn answered 11/11, 2011 at 17:45 Comment(1)

In Unicode 3.2, this usage as a zero-width non-breaking space is deprecated in favor of the "Word Joiner" character, U+2060.This allows U+FEFF to be only used as a BOM. [What should I do with U+FEFF in the middle of a file?]- unicode.org/faq/utf_bom.html#BOM – Supranatural 28/8, 2013 at 2:7

BOM is a byte sequence that specifies what Unicode encoding is used.

BOM is used to inform the decoder how to transform bytes to Unicode (where Unicode can have different binary representation).

It doesn't make any sense to try to put BOM inside a Unicode string.

Chouinard answered 11/11, 2011 at 15:23 Comment(1)

the code posted by OP deletes the byte order mark, not puts it – Publicity 11/11, 2011 at 15:25

Recommended topics

Hot tags