how to write a unicode csv in Python 2.7
Asked Answered
P

1

6

I want to write data to files where a row from a CSV should look like this list (directly from the Python console):

row = ['\xef\xbb\xbft_11651497', 'http://kozbeszerzes.ceu.hu/entity/t/11651497.xml', "Szabolcs Mag '98 Kft.", 'ny\xc3\xadregyh\xc3\xa1za', 'ny\xc3\xadregyh\xc3\xa1za', '4400', 't\xc3\xbcnde utca 20.', 47.935175, 21.744975, u'Ny\xedregyh\xe1za', u'Borb\xe1nya', u'Szabolcs-Szatm\xe1r-Bereg', u'Ny\xedregyh\xe1zai', u'20', u'T\xfcnde utca', u'Magyarorsz\xe1g', u'4405']

Py2k does not do Unicode, but I had a UnicodeWriter wrapper:

import cStringIO, codecs
class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([unicode(s).encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

However, these lines still produce the dreaded encoding error message below:

f.write(codecs.BOM_UTF8)
writer = UnicodeWriter(f)
writer.writerow(row)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)

What is there to do? Thanks!

Profusive answered 29/3, 2014 at 16:25 Comment(24)
Never mind, saw that the csv module recipe indeed uses that dance.Arjan
Are passing in byte strings perhaps? If row contains any data that is a str with characters in the 0x80 to 0xFF range then you'd get that exact exception.Arjan
I cannot reproduce your exception with the sample row provided, certainly. That is correctly written to the file as '\xef\xbb\xbft_23434419,http://kozbeszerzes.ceu.hu/entity/t/23434419.xml,FOREX MEDICAL Kft.,budapest,budapest,1221,kossuth lajos utca 21.,47.4270908,19.0383069,Budapest,XXII. ker\xc3\xbclet,Budapest,Budapest,21,Kossuth Lajos utca,Magyarorsz\xc3\xa1g,1221\r\n'Arjan
@MartijnPieters, Then copying the row from the Python console does not help, perhaps it already fixes encoding problems during printing. How else could I share with you the row that produces the error?Embolic
print repr(row) will provide us with Python-syntax to recreate the row.Arjan
However, the row you posted here is such a representation; row[10] has a U+00FC codepoint but you have a Latin-1 codepoint instead.Arjan
@MartijnPieters You are right, I can reproduce the failure with a row from the console (but not this one), I'll replace the example above. Sorry.Embolic
That row contains strings as I described.Arjan
That row contains the UTF-8 BOM at the start; it's as if you read back the data that you were meant to write, and didn't strip the BOM. You'd decode to Unicode first, not write that straight out again.Arjan
@MartijnPieters Thanks for bearing with me. So the data should work as Unicode already? How? It does fail in a similar way if I try using the standard Py2k csv.writer.Embolic
That is because you are mixing Unicode and bytestrings. Either use only unicode or only bytestrings.Arjan
@MartijnPieters, I see, of course, of course. The first half of the list comes from my own CSV, with bytestrings (right?). The second half has parsed JSON from Google MAP API responses. So maybe their encoding is different. Which is the easier to change?Embolic
You want write out a file using a consistent encoding. You ensure this by decoding to Unicode as early as possible. When reading your CSV, decode to Unicode at that point (perhaps stripping that UTF-8 BOM first).Arjan
@MartijnPieters Decoding line by line of my csv.reader or even before that somehow? Or for each element of each line read? (I also need to look into what it means to strip the BOM.)Embolic
Install unicodecsv and use that to read your CSV (as well as write your new one); it'll handle the decoding for you. It uses the same wrapper you used for writing.Arjan
@MartijnPieters Now I call only unicodecsv.reader and unicodecsv.writer, but get the same error as before. Do I need to do more? Or the API response (second half) is the problem?Embolic
No, JSON responses are always Unicode. You are still mixing in bytestrings somewhere. Look at the row that produces the error; are there any strings without u'' in it.Arjan
@MartijnPieters Never mind, the new encoding broke other things for now. Sorry.Embolic
If you are concatenating Unicode and bytestrings or comparing them, then you still will get implicit conversions. Make sure that all your strings are Unicode objects.Arjan
@MartijnPieters I need plug some of the unicodecsv.reader content into the API query. That can only be ascii, right? A simple .encode('ascii') cannot encode the unicode string. How does this work?Embolic
@MartijnPieters Basically, I have these two lines: address = ' '.join(map(unicode,row[-3::])) data = urllib.urlopen("%s?address=%s&sensor=false&region=hu&language=hu&components=country:HU" % (url, address))Embolic
Use urllib.urlencode() to generate such query strings; encode the data to UTF-8 when you build that string. Incoming data - decode as early as possible. Outgoing data, postpone encoding until the last moment. So encode your address to UTF-8 at that point.Arjan
@MartijnPieters I am still getting a UnicodeEncodeError after these address = ' '.join(map(unicode,row[-3::])) params = urllib.urlencode({'address': address, 'sensor': 'false', 'region': 'hu', 'language': 'hu', 'components':'country:HU'}) data = urllib.urlopen("http://maps.googleapis.com/maps/api/geocode/json?%s" % params)Embolic
You didn't encode address to UTF-8 though, did you. {'address': address.encode('utf8'), ....Arjan
A
12

You are passing bytestrings containing non-ASCII data in, and these are being decoded to Unicode using the default codec at this line:

self.writer.writerow([unicode(s).encode("utf-8") for s in row])

unicode(bytestring) with data that cannot be decoded as ASCII fails:

>>> unicode('\xef\xbb\xbft_11651497')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Decode the data to Unicode before passing it to the writer:

row = [v.decode('utf8') if isinstance(v, str) else v for v in row]

This assumes that your bytestring values contain UTF-8 data instead. If you have a mix of encodings, try to decode to Unicode at the point of origin; where your program first sourced the data. You really want to do so anyway, regardless of where the data came from or if it already was encoded to UTF-8 as well.

Arjan answered 29/3, 2014 at 17:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.