python utf-8-sig BOM in the middle of the file when appending to the end
Asked Answered
R

1

8

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:

>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')

The following text ends up to the file:

<BOM>123
<BOM>123

Isn't that a bug? This is so not logical. Could anyone explain to me why it was done so? Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?

Reconnoitre answered 18/4, 2014 at 12:39 Comment(1)
No, it's not a bug; that's perfectly expected behavior. The codec cannot detect how much was already written to a file.Adrieneadrienne
A
10

No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.

Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.

Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.

If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:

import io

with io.open(filename, 'a', encoding='utf8') as outfh:
    if outfh.tell() == 0:
        # start of file
        outfh.write(u'\ufeff')

I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.

Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.

The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).

Adrieneadrienne answered 18/4, 2014 at 12:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.