I'm using the following code to unzip and save a CSV file:
with gzip.open(filename_gz) as f:
file = open(filename, "w");
output = csv.writer(file, delimiter = ',')
output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))
Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.
I've read that encoding the content in utf-8-sig should fix the issue. However, adding:
.read().encoding('utf-8-sig')
to f in csv.reader fails with:
File "ckan_gz_datastore.py", line 16, in <module>
output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
How can I remove the BOM and just save the content in correct utf-8?