How to display Chinese characters inside a pandas dataframe?
Asked Answered
B

5

6

I can read a csv file in which there is a column containing Chinese characters (other columns are English and numbers). However, Chinese characters don't display correctly. see photo below

enter image description here

I loaded the csv file with pd.read_csv().

Either display(data06_16) or data06_16.head() won't display Chinese characters correctly.

I tried to add the following lines into my .bash_profile:

export LC_ALL=zh_CN.UTF-8
export LANG=zh_CN.UTF-8

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

but it doesn't help.

Also I have tried to add encoding arg to pd.read_csv():

pd.read_csv('data.csv', encoding='utf_8')
pd.read_csv('data.csv', encoding='utf_16')
pd.read_csv('data.csv', encoding='utf_32')

These won't work at all.

How can I display the Chinese characters properly?

Bigler answered 3/9, 2016 at 14:34 Comment(2)
Did you try codecs for Chinese languages -- Say encoding='gb2312'?Euhemerism
Thanks. I tried the encoding you suggested, but an error returned: UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 2-3: illegal multibyte sequenceBigler
B
9

I just remembered that the source dataset was created using encoding='GBK', so I tried again using

data06_16 = pd.read_csv("../data/stocks1542monthly.csv", encoding="GBK")

Now, I can see all the Chinese characters.

Thanks guys!

Bigler answered 3/9, 2016 at 23:37 Comment(0)
Y
3

Try this

df = pd.read_csv(path, engine='python', encoding='utf-8-sig')
Yiyid answered 20/3, 2019 at 1:19 Comment(0)
N
2

I see here three possible issues:

1) You can try this:

import codecs
x = codecs.open("testdata.csv", "r", "utf-8")

2) Another possibility can be theoretically this:

import pandas as pd
df = pd.DataFrame(pd.read_csv('testdata.csv',encoding='utf-8')) 

3) Maybe you should convert you csv file into utf-8 before importing with Python (for example in Notepad++)? It can be a solution for one-time-import, not for automatic process, of course.

Narvaez answered 3/9, 2016 at 18:55 Comment(0)
N
0

A non-python relate answer. I just ran into this problem this afternoon and found that using Excel to import data from CSV can show us lots of encoding names. We can play with the encodings there and see which one fit our need. For instance, I found that in excel both gb2312 and gb18030 convert the data nicely from csv to xlsx. But only gb18030 works in Python.

pd.read_csv(in_path + 'XXX.csv', encoding='gb18030')

Anyway, this is not about how to import csv in Python, but rather to find the available encodings to try. enter image description here

Nashner answered 12/8, 2021 at 13:8 Comment(2)
Hi, may I ask how do you reach this step in excel? thanks.Zulemazullo
Search for import csv to excel with encoding option should yield fruitful results.Nashner
E
0

You load a dataset and you have some strange characters. Exemple :

'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap HS01(铜金色礼盒版)'

In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.

So first step, I had encoded the string, then I decode with utf-8. so my lines are :

_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')

Then my output is :

"'森Dyson Airwrap HS01礼'"

This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.

Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english

EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe

_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))
Eustasius answered 22/1, 2022 at 17:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.