I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it
OR do I need to make use of programming languages like C# or PHP to deduce it.
I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it
OR do I need to make use of programming languages like C# or PHP to deduce it.
You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory
and looking in the drop down.
windows-1252
or something), it's using something like chardet. But what? We have no idea. I wouldn't recommend this approach. It's a black box. –
Auramine In Linux systems, you can use file command. It will give the correct encoding
Sample:
file blah.csv
Output:
blah.csv: ISO-8859 text, with very long lines
ASCII-text
. However, maybe ASCII-text
here stands for system defaulted which is beyond its original meaning. –
Depute If you use Python, just use a print() function to check the encoding of a csv file. For example:
with open('file_name.csv') as f:
print(f)
The output is something like this:
<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf8'>
utf-16
, but the approach in this answer returns cp1252
. Trying to open it with Pandas CSV reader as cp1252
returns a ParserError
, so, something's amiss somewhere. –
Mayapple ANSI
encoded, but the above steps gave me cp1252
as encoding, which when I tried opening by mentioning cp1252
as encoding failed. UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 54852: character maps to <undefined>
–
Enosis open(file, ..., encoding=None)
. In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False)
. The encoding of a CSV file is determined by the platform/program it was created on. If you don't know the context, chardet
is a good start, but know that it's more than a decode old and has no support for emoticons etc. Use encoding=utf-8
is more robust nowadays. –
Shupe cp1252
, even though I can see in Notepad++ that the csv file is utf-8. Only when doing with open('file_name.csv', encoding='utf-8') as f:
it does in fact show utf-8, which is not helpful. –
Comeon cp1252
. For more details, see the docs on open
. –
Secession You can also use python chardet library
# install the chardet library
!pip install chardet
# import the chardet library
import chardet
# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("test.csv", 'rb') as file:
print(chardet.detect(file.read()))
Use chardet https://github.com/chardet/chardet (documentation is short and easy to read).
Install python, then pip install chardet, at last use the command line command.
I tested under GB2312 and it's pretty accurate. (Make sure you have at least a few characters, sample with only 1 character may fail easily).
file
is not reliable as you can see.
Or you can execute in python console or in Jupyter Notebook:
import csv
data = open("file.csv","r")
data
You will see information about the data object like this:
<_io.TextIOWrapper name='arch.csv' mode='r' encoding='cp1250'>
As you can see it contains encoding infotmation.
CSV files have no headers indicating the encoding.
You can only guess by looking at:
In 2021, emoticons are widely used, but many import tools fail to import them. The chardet
library is often recommended in the answers above, but the lib does not handle emoticons well.
icecream = '🍦'
import csv
with open('test.csv', 'w') as f:
wf = csv.writer(f)
wf.writerow(['ice cream', icecream])
import chardet
with open('test.csv', 'rb') as f:
print(chardet.detect(f.read()))
{'encoding': 'Windows-1254', 'confidence': 0.3864823918622268, 'language': 'Turkish'}
This gives UnicodeDecodeError while trying to read the file with this encoding.
The default encoding on Mac is UTF-8. It's included explicitly here but that wasn't even necessary... but on Windows it might be.
with open('test.csv', 'r', encoding='utf-8') as f:
print(f.read())
ice cream,🍦
The file
command also picked this up
file test.csv
test.csv: UTF-8 Unicode text, with CRLF line terminators
My advice in 2021, if the automatic detection goes wrong: try UTF-8
before resorting to chardet
.
In Python, you can try...
from encodings.aliases import aliases
alias_values = set(aliases.values())
for encoding in set(aliases.values()):
try:
df=pd.read_csv("test.csv", encoding=encoding)
print('successful', encoding)
except:
pass
As it is mentioned by @3724913 (Jitender Kumar) to use file
command (it also works in WSL on Windows), I was able to get encoding information of a csv file by executing file --exclude encoding blah.csv
using info available on man file
as file blah.csv
won't show the encoding info on my system.
file
only samples the file, so it is somewhat likely to tell you US-ASCII for a file which contains plain text and only a few non-ASCII characters. Similarly it cannot reliably identify differences between similar encodings, such as ISO 8859-1 vs Windows code page 1252. –
Antakiya I just added a function to find the correct encoding and read the CSV in the given file path.
import pandas as pd
import chardet
def read_csv(path: str, size: float = 0.10) -> pd.DataFrame:
"""
Reads a CSV file located at path and returns it as a Pandas DataFrame. If
nrows is provided, only the first nrows rows of the CSV file will be
read. Otherwise, all rows will be read.
Args:
path (str): The path to the CSV file.
size (float): The fraction of the file to be used for detecting the
encoding. Defaults to 0.10.
Returns:
pd.DataFrame: The CSV file as a Pandas DataFrame.
Raises:
UnicodeError: If the encoding of the file cannot be detected with the
initial size, the function will retry with a larger size (increased by
0.20) until the encoding can be detected or an error is raised.
"""
try:
byte_size = int(os.path.getsize(path) * size)
with open(path, "rb") as rawdata:
result = chardet.detect(rawdata.read(byte_size))
return pd.read_csv(path, encoding=result["encoding"])
except UnicodeError:
return read_csv(path=path, size=size + 0.20)
Just add the encoding
argument that matches the file you`re trying to upload.
open('example.csv', encoding='UTF8')
© 2022 - 2024 — McMap. All rights reserved.