Detect Byte Order Mark (BOM) in Python
Asked Answered
C

1

2

I've found lots of posts describing how to parse/ignore BOMs but can't find anything on how to simply output a true/false as to whether a file contains a BOM. Can anyone point me in the right direction to do this in Python?

Chatoyant answered 22/1, 2021 at 8:28 Comment(1)
It is complex, because in theory (according Unicode), you should know if you have a BOM (a BOM just defines the byte-order if and only if the encoding doesn't specify it. As usual, Microsoft has different idea on how to interpret standards, adding confusion. Web has own encoding detection algorithm (but no need of wrongly BOM for UTF-8, because default is UTF-8 (and HTML has ASCII characters, so it is also easier to detect byte-order). The answer has the standard encoding of Unicode BOM (but such sequences are legal on many other encodings).Facetious
A
5

The simple answer is: read the first 4 bytes and look at them.

with open("utf32le.file", "rb") as file:
    beginning = file.read(4)
    # The order of these if-statements is important
    # otherwise UTF32 LE may be detected as UTF16 LE as well
    if beginning == b'\x00\x00\xfe\xff':
        print("UTF-32 BE")
    elif beginning == b'\xff\xfe\x00\x00':
        print("UTF-32 LE")
    elif beginning[0:3] == b'\xef\xbb\xbf':
        print("UTF-8")
    elif beginning[0:2] == b'\xff\xfe':
        print("UTF-16 LE")
    elif beginning[0:2] == b'\xfe\xff':
        print("UTF-16 BE")
    else:
        print("Unknown or no BOM")

The not so simple answer is:

There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.

Other than that you can typically treat text files without BOM as UTF-8 as well.

Absently answered 22/1, 2021 at 8:47 Comment(2)
Note that Python now provides these constants and more in the codecs moduleOpine
I'll also note that Python's built-in UTF-16/32 codecs parse BOMs correctly. UTF-8 is the only one that needs special handling, and Python provides UTF-8-SIG which can be dropped in place of UTF-8 codecs to transparently handle the BOM.Opine

© 2022 - 2024 — McMap. All rights reserved.