I've found lots of posts describing how to parse/ignore BOMs but can't find anything on how to simply output a true/false as to whether a file contains a BOM. Can anyone point me in the right direction to do this in Python?
Detect Byte Order Mark (BOM) in Python
Asked Answered
It is complex, because in theory (according Unicode), you should know if you have a BOM (a BOM just defines the byte-order if and only if the encoding doesn't specify it. As usual, Microsoft has different idea on how to interpret standards, adding confusion. Web has own encoding detection algorithm (but no need of wrongly BOM for UTF-8, because default is UTF-8 (and HTML has ASCII characters, so it is also easier to detect byte-order). The answer has the standard encoding of Unicode BOM (but such sequences are legal on many other encodings). –
Facetious
The simple answer is: read the first 4 bytes and look at them.
with open("utf32le.file", "rb") as file:
beginning = file.read(4)
# The order of these if-statements is important
# otherwise UTF32 LE may be detected as UTF16 LE as well
if beginning == b'\x00\x00\xfe\xff':
print("UTF-32 BE")
elif beginning == b'\xff\xfe\x00\x00':
print("UTF-32 LE")
elif beginning[0:3] == b'\xef\xbb\xbf':
print("UTF-8")
elif beginning[0:2] == b'\xff\xfe':
print("UTF-16 LE")
elif beginning[0:2] == b'\xfe\xff':
print("UTF-16 BE")
else:
print("Unknown or no BOM")
The not so simple answer is:
There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.
Other than that you can typically treat text files without BOM as UTF-8 as well.
Note that Python now provides these constants and more in the codecs module –
Opine
I'll also note that Python's built-in UTF-16/32 codecs parse BOMs correctly. UTF-8 is the only one that needs special handling, and Python provides UTF-8-SIG which can be dropped in place of UTF-8 codecs to transparently handle the BOM. –
Opine
© 2022 - 2024 — McMap. All rights reserved.