Removing BOM from gzip'ed CSV in Python

File "ckan_gz_datastore.py", line 16, in <module> output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';')) File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

First, you need to decode the file contents, not encode them.

Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.

Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.

So:

csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())

However, you might consider it simpler / more efficent just to remove the BOM manually:

def remove_bom(line):
    return line[3:] if line.startswith(codecs.BOM_UTF8) else line

csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')

That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:

def remove_bom_from_first(iterable):
    f = iter(iterable)
    firstline = next(f, None)
    if firstline is not None:
        yield remove_bom(firstline)
        for line in f:
            yield f

Recommended topics

Hot tags