I want to write data to files where a row from a CSV should look like this list (directly from the Python console):
row = ['\xef\xbb\xbft_11651497', 'http://kozbeszerzes.ceu.hu/entity/t/11651497.xml', "Szabolcs Mag '98 Kft.", 'ny\xc3\xadregyh\xc3\xa1za', 'ny\xc3\xadregyh\xc3\xa1za', '4400', 't\xc3\xbcnde utca 20.', 47.935175, 21.744975, u'Ny\xedregyh\xe1za', u'Borb\xe1nya', u'Szabolcs-Szatm\xe1r-Bereg', u'Ny\xedregyh\xe1zai', u'20', u'T\xfcnde utca', u'Magyarorsz\xe1g', u'4405']
Py2k does not do Unicode, but I had a UnicodeWriter wrapper:
import cStringIO, codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
However, these lines still produce the dreaded encoding error message below:
f.write(codecs.BOM_UTF8)
writer = UnicodeWriter(f)
writer.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)
What is there to do? Thanks!
csv
module recipe indeed uses that dance. – Arjanrow
contains any data that is astr
with characters in the 0x80 to 0xFF range then you'd get that exact exception. – Arjan'\xef\xbb\xbft_23434419,http://kozbeszerzes.ceu.hu/entity/t/23434419.xml,FOREX MEDICAL Kft.,budapest,budapest,1221,kossuth lajos utca 21.,47.4270908,19.0383069,Budapest,XXII. ker\xc3\xbclet,Budapest,Budapest,21,Kossuth Lajos utca,Magyarorsz\xc3\xa1g,1221\r\n'
– Arjanprint repr(row)
will provide us with Python-syntax to recreate the row. – Arjanrow
you posted here is such a representation;row[10]
has a U+00FC codepoint but you have a Latin-1 codepoint instead. – Arjanunicodecsv
and use that to read your CSV (as well as write your new one); it'll handle the decoding for you. It uses the same wrapper you used for writing. – Arjanu''
in it. – Arjanaddress = ' '.join(map(unicode,row[-3::]))
data = urllib.urlopen("%s?address=%s&sensor=false®ion=hu&language=hu&components=country:HU" % (url, address))
– Embolicurllib.urlencode()
to generate such query strings; encode the data to UTF-8 when you build that string. Incoming data - decode as early as possible. Outgoing data, postpone encoding until the last moment. So encode youraddress
to UTF-8 at that point. – Arjanaddress = ' '.join(map(unicode,row[-3::]))
params = urllib.urlencode({'address': address, 'sensor': 'false', 'region': 'hu', 'language': 'hu', 'components':'country:HU'})
data = urllib.urlopen("http://maps.googleapis.com/maps/api/geocode/json?%s" % params)
– Embolic{'address': address.encode('utf8'), ...
. – Arjan