No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.
Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()
) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.
Only use utf-8-sig
on a new file; the codec will always write the BOM out whenever you use it.
If you are working directly with files, you can test for the start yourself; use utf-8
instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:
import io
with io.open(filename, 'a', encoding='utf8') as outfh:
if outfh.tell() == 0:
# start of file
outfh.write(u'\ufeff')
I used the newer io.open()
instead of codecs.open()
; io
is the new I/O framework developed for Python 3, and is more robust than codecs
for handling encoded files, in my experience.
Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).