Detect Byte Order Mark (BOM) in Python

About

Asked 22/1, 2021 at 8:28 Answered 22/1, 2021 at 8:47

I've found lots of posts describing how to parse/ignore BOMs but can't find anything on how to simply output a true/false as to whether a file contains a BOM. Can anyone point me in the right direction to do this in Python?

Chatoyant answered 22/1, 2021 at 8:28 Comment(1)

It is complex, because in theory (according Unicode), you should know if you have a BOM (a BOM just defines the byte-order if and only if the encoding doesn't specify it. As usual, Microsoft has different idea on how to interpret standards, adding confusion. Web has own encoding detection algorithm (but no need of wrongly BOM for UTF-8, because default is UTF-8 (and HTML has ASCII characters, so it is also easier to detect byte-order). The answer has the standard encoding of Unicode BOM (but such sequences are legal on many other encodings). – Facetious 22/1, 2021 at 9:13

The simple answer is: read the first 4 bytes and look at them.

with open("utf32le.file", "rb") as file:
    beginning = file.read(4)
    # The order of these if-statements is important
    # otherwise UTF32 LE may be detected as UTF16 LE as well
    if beginning == b'\x00\x00\xfe\xff':
        print("UTF-32 BE")
    elif beginning == b'\xff\xfe\x00\x00':
        print("UTF-32 LE")
    elif beginning[0:3] == b'\xef\xbb\xbf':
        print("UTF-8")
    elif beginning[0:2] == b'\xff\xfe':
        print("UTF-16 LE")
    elif beginning[0:2] == b'\xfe\xff':
        print("UTF-16 BE")
    else:
        print("Unknown or no BOM")

The not so simple answer is:

There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.

Other than that you can typically treat text files without BOM as UTF-8 as well.

Absently answered 22/1, 2021 at 8:47 Comment(2)

Note that Python now provides these constants and more in the codecs module – Opine 2/3, 2022 at 15:36

I'll also note that Python's built-in UTF-16/32 codecs parse BOMs correctly. UTF-8 is the only one that needs special handling, and Python provides UTF-8-SIG which can be dropped in place of UTF-8 codecs to transparently handle the BOM. – Opine 3/3, 2022 at 14:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags