Python - Reading and writing csv files with utf-8 encoding
Asked Answered
U

1

11

I'm trying to read a csv file the its header contains foreign characters and I'm having a lot of problems with this.

first of all, I'm reading the file with a simple csv.reader

filename = 'C:\\Users\\yuval\\Desktop\\בית ספר\\עבודג\\new\\resources\\mk'+ str(mkNum) + 'Data.csv'
raw_data = open(filename, 'rt', encoding="utf8")
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
header = x[0]
data = np.array(x[1:]).astype('float')

The var header should be an array that contains the file headers, but the array it returns to me is

['\ufeff"dayPart"', '"length"', '"ifPhoto"', '"ifVideo"', '"ifAlbum"', '"לא"', '"הוא"', '"בכל"', '"אותם"', '"זה"', '"הם"', '"כדי"', '"את"', '"יש"', '"לי"', '"היא"', '"אני"', '"רק"', '"להם"', '"על"', '"עם"', '"של"', '"המדינה"', '"כל"', '"גם"', '"הזה"', '"אם"', '"ישראל"', '"לכל"', '"מי"', '"ל"', '"אמסלם"', '"לנו"', '"אבל"', '"זו"', '"אין"', '"שבת"', '"שלום"', '"כ"', '"שלנו"', '"היום"', '"ומבורך"', '"ח"', '"דודי"', '"ר"', '"הפנים"', '"מה"', '"כי"', '"ה"', '"אחד"', '"ולא"', '"יותר"']

and I don't know why it adds the \ufeff in the first object and double quotation marks.

After that, I need to write to another csv file and use foreign characters in the header as well. I was trying to do this like that, but it wrote the characters as weird symbols.

with open('C:\\Users\\yuval\\Desktop\\בית ספר\\עבודג\\new\\variance reduction 1\\mk'+ str(mkNum) + 'Data.csv', 'w', newline='', encoding='utf8') as csvFile:
    csvWriter = csv.writer(csvFile, delimiter=',')
    csvWriter.writerow(newHeader)

Does any one know how to fix this problem and work with utf8 encoding in the csv file's header?

Unfleshly answered 3/1, 2018 at 21:24 Comment(3)
which version of python are you using? utf-8 should be default for python 3Discontinuous
The \ufeff is a Byte Order Mark that can often be found on Windows UTF-8 files, and it might be confusing csv. Try using utf-8-sig for the encoding.Paid
@Discontinuous It must be Python 3, since Python 2 cannot return '\ufeff' (a character beyond U+FF in a u-less string). However, I challenge your claim that utf-8 is the "default for Python 3": for opening files, the default encoding is locale-dependent. The default encoding for source code is UTF-8, but that's irrelevant to the current question.Terzetto
T
13

You report three separate problems. This is a bit of a guess into the blue, because there's not enough information to be sure, but you should try the following:

  1. input encoding: As suggested in comments, try "utf-8-sig". This will remove the Byte Order Mark (BOM) from your input.

  2. double quotes: Among the csv parameters, you specify quoting=csv.QUOTE_NONE. This tells the csv library that the CSV table was written without using quotes (for escaping characters that could otherwise be mistaken for field or row separators). However, this is apparently not true, since the input has quotes around each field. Try csv.QUOTE_MINIMAL (the default) or csv.QUOTE_ALL instead.

  3. output encoding: You say the output contains "weird symbols". I suspect that the output is actually alright, but you are using a tool which doesn't properly display UTF-8 text by default: many Windows applications (such as Excel) still prefer UTF-16 and localised 8-bit encodings like CP-1255. Like for problem 1, you should try the codec "utf-8-sig": the BOM is understood as an encoding hint by many viewers/editors.

Terzetto answered 3/1, 2018 at 22:7 Comment(1)
Thank you. I couldn't work out how to remove the BOM, and I'd never heard of 'utf-8-sig' encoding!Vendor

© 2022 - 2024 — McMap. All rights reserved.