How to determine the encoding of a CSV file?

Note: In general, identifying the original encoding of a text file is not a deterministic problem. If there are no metadata (eg. an HTML content-type header), you can only guess. There are tools and libraries out there that help you guessing – and some of them do a pretty good job – but you can't be 100% sure. This is especially true if 8-bit encodings (like Latin-1, Windows CP1252 etc.) are involved.

But if you already know that the encoding must be either UTF-8 or UTF-16, then you're in a good situation.

UTF-16-encoded text files must always begin with a BOM. You can use this fact to detect its presence. There are two different "flavors" of UTF-16 – Big Endian (BE) and Low Endian (LE). Since UTF-16 uses two-byte words (16 bits), there are two ways to compose them: high-byte first (BE) or low-byte first (LE). You can tell from the BOM, ie. by looking at the very first two bytes of the file:

FE FF → UTF-16 BE
FF FE → UTF-16 LE

For UTF-8, a BOM is not strictly needed – in fact, using it is actually non-standard. However, the fact that many Windows application have continuously refused to recognise UTF-8 encoding unless it contains a BOM led to a pseudo-standard "UTF-8 with BOM". If the BOM is present, it occupies the first three bytes of the file:

EF BB BF → UTF-8 with BOM

If your file starts with something different, then you either have BOM-less UTF-8, or some non-UTF encoding (ASCII, Latin-1...).

Recommended topics

Hot tags