I'm writting script that has to make some operations on CSV file, but I have no idea if file will be encoded with utf-8 or utf-16. How to check if given csv file cointains utf-16 BOM?
Note: In general, identifying the original encoding of a text file is not a deterministic problem. If there are no metadata (eg. an HTML content-type header), you can only guess. There are tools and libraries out there that help you guessing – and some of them do a pretty good job – but you can't be 100% sure. This is especially true if 8-bit encodings (like Latin-1, Windows CP1252 etc.) are involved.
But if you already know that the encoding must be either UTF-8 or UTF-16, then you're in a good situation.
UTF-16-encoded text files must always begin with a BOM. You can use this fact to detect its presence. There are two different "flavors" of UTF-16 – Big Endian (BE) and Low Endian (LE). Since UTF-16 uses two-byte words (16 bits), there are two ways to compose them: high-byte first (BE) or low-byte first (LE). You can tell from the BOM, ie. by looking at the very first two bytes of the file:
FE FF
→ UTF-16 BEFF FE
→ UTF-16 LE
For UTF-8, a BOM is not strictly needed – in fact, using it is actually non-standard. However, the fact that many Windows application have continuously refused to recognise UTF-8 encoding unless it contains a BOM led to a pseudo-standard "UTF-8 with BOM". If the BOM is present, it occupies the first three bytes of the file:
EF BB BF
→ UTF-8 with BOM
If your file starts with something different, then you either have BOM-less UTF-8, or some non-UTF encoding (ASCII, Latin-1...).
© 2022 - 2024 — McMap. All rights reserved.