How to check encoding of a CSV file
Asked Answered
P

11

110

I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it

OR do I need to make use of programming languages like C# or PHP to deduce it.

Portcullis answered 12/5, 2016 at 4:7 Comment(2)
Possible duplicate of superuser.com/questions/280603/…Particia
Very closely related: How to determine encoding table of a text fileSecession
L
85

You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.

Lippizaner answered 12/5, 2016 at 4:38 Comment(9)
Huh? Is this using chardet or what? This feels like a bad answer.Auramine
Are you asking if Windows' built in text editor, notepad uses a python (or potentially js) library? I'm certain it doesn't since it'll be written in a native compiled language like C++. Also, text encoding is based on facts, not feelings so if you have knowledge of why this is a bad answer, by all means, raise it, but your feelings are irrelevant to the discussion.Lippizaner
Here's the thing: the CSV format doesn't have a way of identifying its character encoding. The format itself doesn't. So anything that opens a CSV has to guess what encoding was used to create the file. The question is, how does notepad do this? Chardet is a library that's used for this and it's based on the way browsers do it (IIRC, it was originally a C program). If notepad is guessing the encoding (as opposed to assuming all CSVs are windows-1252 or something), it's using something like chardet. But what? We have no idea. I wouldn't recommend this approach. It's a black box.Auramine
I'll add one more thing. Chardet (and other libraries) work by sniffing the contents of the file for binary patterns that would be common in certain encodings. If you do this programmatically, you can check the whole file for these kinds of patterns. If you do it via a GUI, it has to be fast, and you'll almost definitely limit yourself to the first X bytes of the file, which might not contain the pattern. So...I recommend using a library for this purpose, which can read the entire file, not a GUI like notepad, which would need to be performant for the user..Auramine
The OP specifically asked for a solution that doesn't require the use of a library. Yes, it's not a perfect solution with the unknowns involved, we do know that 1. notepad loads a lot of data since it slows with larger files. 2. Notepad is very mature and written by Microsoft so it most likely does a pretty good job of detecting the encoding. On balance, I feel the solution is good enough and requires the least effort.Lippizaner
Loading the data into the GUI may or may not be the same thing as sniffing it for an encoding. The time taken there could be totally different. We just don't know how much of the file it's sniffing or whether it's doing a good job — or even if it's doing a good job at all? Do we know whether it just has a default or whether it's sniffing? The OP also suggested that a programming language might be needed — to do this correctly, it is.Auramine
Great suggestion. Thanks!Suffocate
I'm using Mac, which Notepad++ is not compatible with. The OP doesn't mention that they're on Windows, and mentioning Microsoft Excel doesn't mean they're on Windows because Excel is compatible with Mac. It should be mentioned in the answer that the solution is only meant for Windows.Johnathan
Just to belatedly clear up a couple of misconceptions in earlier comments; Notepad++ is not Notepad, but both of them probably use Microsoft's own library for character encoding detection under the hood. It was infamously exposed in th early 2000s when typing "Bush hid the facts" and reloading the file would trigger a bad heuristic. en.wikipedia.org/wiki/Bush_hid_the_factsAntakiya
S
82

In Linux systems, you can use file command. It will give the correct encoding

Sample:

file blah.csv

Output:

blah.csv: ISO-8859 text, with very long lines
Sayres answered 24/4, 2018 at 17:4 Comment(3)
Caution: I don't think it's very reliable. I saved a plain csv and checked the hex code and it was GB2312 (my system default encoding). But the file result is ASCII-text. However, maybe ASCII-text here stands for system defaulted which is beyond its original meaning.Depute
@Rick. Many many encodings are supersets of ASCII. From a reading of the Wikipedia article, GB2312/EUC-CN text which contains only ASCII characters is indistinguishable from ASCII.Foreshow
It is not "indistinguishable"; if you know some common patterns in ASCII-compatible encodings and/or the characters it's supposed to represent, it's generally possible to come up with a good guess in many real-world scenarios. The tricky corner case is when there is too little data to do any meaningful statistical analysis.Antakiya
L
68

If you use Python, just use a print() function to check the encoding of a csv file. For example:

with open('file_name.csv') as f:
    print(f)

The output is something like this:

<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf8'>
Locker answered 4/10, 2018 at 14:8 Comment(5)
Note: this may not be correct 100% of the time. I have a CSV file that can be opened correctly as utf-16, but the approach in this answer returns cp1252. Trying to open it with Pandas CSV reader as cp1252 returns a ParserError, so, something's amiss somewhere.Mayapple
This did not work for me. The file I had was ANSI encoded, but the above steps gave me cp1252 as encoding, which when I tried opening by mentioning cp1252 as encoding failed. UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 54852: character maps to <undefined>Enosis
From the python3 docs: open(file, ..., encoding=None). In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False). The encoding of a CSV file is determined by the platform/program it was created on. If you don't know the context, chardet is a good start, but know that it's more than a decode old and has no support for emoticons etc. Use encoding=utf-8 is more robust nowadays.Shupe
This answer seems wrong! It always shows cp1252, even though I can see in Notepad++ that the csv file is utf-8. Only when doing with open('file_name.csv', encoding='utf-8') as f: it does in fact show utf-8, which is not helpful.Comeon
This answer is totally wrong. The encoding Python uses to open the file does not necessarily correspond with the encoding of the data in the file; Python just uses the platform's default encoding. On Linux and I believe Mac, that's UTF-8, while on Windows, it's CP1252, hence the comments above saying this always returns cp1252. For more details, see the docs on open.Secession
M
25

You can also use python chardet library

# install the chardet library
!pip install chardet

# import the chardet library
import chardet 

# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("test.csv", 'rb') as file:
    print(chardet.detect(file.read()))
Mcmath answered 4/4, 2020 at 8:3 Comment(0)
D
11

Use chardet https://github.com/chardet/chardet (documentation is short and easy to read).

Install python, then pip install chardet, at last use the command line command.

I tested under GB2312 and it's pretty accurate. (Make sure you have at least a few characters, sample with only 1 character may fail easily).

file is not reliable as you can see.

enter image description here

Depute answered 22/5, 2018 at 9:10 Comment(2)
Indeed, chardetect (the cli tool from python-chardet package shipped with most distro) is more reliable than file, but it can also take much longer (especially with huge file) to compute a result with a certain degree of confidenceAmmeter
If this takes too long you can sample part of the file to achieve a decent result, e.g. start by doing "head -n 500 bigfile.csv > smallerfile.csv" to get the top 500 lines to use as a sample.Cotta
S
5

Or you can execute in python console or in Jupyter Notebook:

import csv
data = open("file.csv","r") 
data

You will see information about the data object like this:

<_io.TextIOWrapper name='arch.csv' mode='r' encoding='cp1250'>

As you can see it contains encoding infotmation.

Skurnik answered 22/1, 2021 at 16:7 Comment(2)
Uh, no, that's useless. It simply reveals which encoding your system was set up to use by default (western Windows apparently in your case).Antakiya
Not only is this wrong, it also duplicates another answer from 2018 which has multiple comments explaining what's wrong with it.Antakiya
S
5

CSV files have no headers indicating the encoding.

You can only guess by looking at:

  • the platform / application the file was created on
  • the bytes in the file

In 2021, emoticons are widely used, but many import tools fail to import them. The chardet library is often recommended in the answers above, but the lib does not handle emoticons well.

icecream = '🍦'

import csv

with open('test.csv', 'w') as f:
    wf = csv.writer(f)
    wf.writerow(['ice cream', icecream])


import chardet
with open('test.csv', 'rb') as f:
    print(chardet.detect(f.read()))

{'encoding': 'Windows-1254', 'confidence': 0.3864823918622268, 'language': 'Turkish'}

This gives UnicodeDecodeError while trying to read the file with this encoding.

The default encoding on Mac is UTF-8. It's included explicitly here but that wasn't even necessary... but on Windows it might be.

with open('test.csv', 'r', encoding='utf-8') as f:
    print(f.read())

ice cream,🍦

The file command also picked this up

file test.csv
test.csv: UTF-8 Unicode text, with CRLF line terminators

My advice in 2021, if the automatic detection goes wrong: try UTF-8 before resorting to chardet.

Shupe answered 4/7, 2021 at 19:10 Comment(0)
M
1

In Python, you can try...

from encodings.aliases import aliases
alias_values = set(aliases.values())

for encoding in set(aliases.values()):
    try:
        df=pd.read_csv("test.csv", encoding=encoding)
        print('successful', encoding)
    except:
        pass
Mcmath answered 4/4, 2020 at 7:51 Comment(1)
Basically, this will select one of the available 8-bit encodings at random (or UTF-16). The fact that decoding succeeds only means that the file did not contain a byte which isn't permitted by the encoding. You are virtually guaranteed to get mojibake from any nontrivial input file.Antakiya
S
1

As it is mentioned by @3724913 (Jitender Kumar) to use file command (it also works in WSL on Windows), I was able to get encoding information of a csv file by executing file --exclude encoding blah.csv using info available on man file as file blah.csv won't show the encoding info on my system.

Shirleneshirley answered 24/1, 2022 at 9:22 Comment(1)
file only samples the file, so it is somewhat likely to tell you US-ASCII for a file which contains plain text and only a few non-ASCII characters. Similarly it cannot reliably identify differences between similar encodings, such as ISO 8859-1 vs Windows code page 1252.Antakiya
S
0

I just added a function to find the correct encoding and read the CSV in the given file path.

import pandas as pd
import chardet
def read_csv(path: str, size: float = 0.10) -> pd.DataFrame:
 """
 Reads a CSV file located at path and returns it as a Pandas DataFrame. If 
 nrows is provided, only the first nrows rows of the CSV file will be 
 read. Otherwise, all rows will be read.

 Args:
    path (str): The path to the CSV file.
    size (float): The fraction of the file to be used for detecting the 
    encoding. Defaults to 0.10.

 Returns:
    pd.DataFrame: The CSV file as a Pandas DataFrame.

 Raises:
    UnicodeError: If the encoding of the file cannot be detected with the 
    initial size, the function will retry with a larger size (increased by 
    0.20) until the encoding can be detected or an error is raised.
 """
 try:
    byte_size = int(os.path.getsize(path) * size)

    with open(path, "rb") as rawdata:
        result = chardet.detect(rawdata.read(byte_size))

    return pd.read_csv(path, encoding=result["encoding"])

 except UnicodeError:
    return read_csv(path=path, size=size + 0.20)
Stradivarius answered 14/2, 2023 at 9:21 Comment(1)
Tangentially the single-space indentation makes this rather hard to read. A common convention is to use four spaces for each indentation level.Antakiya
A
-3

Just add the encoding argument that matches the file you`re trying to upload.

open('example.csv', encoding='UTF8')
Adorne answered 23/6, 2021 at 14:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.