I'm trying to read a csv file the its header contains foreign characters and I'm having a lot of problems with this.
first of all, I'm reading the file with a simple csv.reader
filename = 'C:\\Users\\yuval\\Desktop\\בית ספר\\עבודג\\new\\resources\\mk'+ str(mkNum) + 'Data.csv'
raw_data = open(filename, 'rt', encoding="utf8")
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
header = x[0]
data = np.array(x[1:]).astype('float')
The var header should be an array that contains the file headers, but the array it returns to me is
['\ufeff"dayPart"', '"length"', '"ifPhoto"', '"ifVideo"', '"ifAlbum"', '"לא"', '"הוא"', '"בכל"', '"אותם"', '"זה"', '"הם"', '"כדי"', '"את"', '"יש"', '"לי"', '"היא"', '"אני"', '"רק"', '"להם"', '"על"', '"עם"', '"של"', '"המדינה"', '"כל"', '"גם"', '"הזה"', '"אם"', '"ישראל"', '"לכל"', '"מי"', '"ל"', '"אמסלם"', '"לנו"', '"אבל"', '"זו"', '"אין"', '"שבת"', '"שלום"', '"כ"', '"שלנו"', '"היום"', '"ומבורך"', '"ח"', '"דודי"', '"ר"', '"הפנים"', '"מה"', '"כי"', '"ה"', '"אחד"', '"ולא"', '"יותר"']
and I don't know why it adds the \ufeff in the first object and double quotation marks.
After that, I need to write to another csv file and use foreign characters in the header as well. I was trying to do this like that, but it wrote the characters as weird symbols.
with open('C:\\Users\\yuval\\Desktop\\בית ספר\\עבודג\\new\\variance reduction 1\\mk'+ str(mkNum) + 'Data.csv', 'w', newline='', encoding='utf8') as csvFile:
csvWriter = csv.writer(csvFile, delimiter=',')
csvWriter.writerow(newHeader)
Does any one know how to fix this problem and work with utf8 encoding in the csv file's header?
\ufeff
is a Byte Order Mark that can often be found on Windows UTF-8 files, and it might be confusingcsv
. Try usingutf-8-sig
for the encoding. – Paid'\ufeff'
(a character beyond U+FF in au
-less string). However, I challenge your claim that utf-8 is the "default for Python 3": for opening files, the default encoding is locale-dependent. The default encoding for source code is UTF-8, but that's irrelevant to the current question. – Terzetto