How to determine the encoding of a CSV file?
Asked Answered
N

1

3

I'm writting script that has to make some operations on CSV file, but I have no idea if file will be encoded with utf-8 or utf-16. How to check if given csv file cointains utf-16 BOM?

Nomarch answered 11/2, 2019 at 17:45 Comment(3)
Sounds like may be impossible — see How to determine the encoding of text?Government
UTF-16 is not much used to exchange data. Try with an editor (or a browser) and check different encoding: when you see good data, it could be the correct encoding. If you see many 00 bytes, it is nearly certain a UTF-16 (or other 16 or more bits encoding). [a csv file need to have a comma, so U+002C, so in this case you have to have the 00 byte]Likeminded
It might be more straightforward to tell the sender that you only accept UTF-8 (or whatever). Or accept a file format where the character encoding is not separated from the file, like .xlsx.Quibbling
T
3

Note: In general, identifying the original encoding of a text file is not a deterministic problem. If there are no metadata (eg. an HTML content-type header), you can only guess. There are tools and libraries out there that help you guessing – and some of them do a pretty good job – but you can't be 100% sure. This is especially true if 8-bit encodings (like Latin-1, Windows CP1252 etc.) are involved.

But if you already know that the encoding must be either UTF-8 or UTF-16, then you're in a good situation.

UTF-16-encoded text files must always begin with a BOM. You can use this fact to detect its presence. There are two different "flavors" of UTF-16 – Big Endian (BE) and Low Endian (LE). Since UTF-16 uses two-byte words (16 bits), there are two ways to compose them: high-byte first (BE) or low-byte first (LE). You can tell from the BOM, ie. by looking at the very first two bytes of the file:

  • FE FF → UTF-16 BE
  • FF FE → UTF-16 LE

For UTF-8, a BOM is not strictly needed – in fact, using it is actually non-standard. However, the fact that many Windows application have continuously refused to recognise UTF-8 encoding unless it contains a BOM led to a pseudo-standard "UTF-8 with BOM". If the BOM is present, it occupies the first three bytes of the file:

  • EF BB BF → UTF-8 with BOM

If your file starts with something different, then you either have BOM-less UTF-8, or some non-UTF encoding (ASCII, Latin-1...).

Throaty answered 11/2, 2019 at 20:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.