I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)
According to the SEC the data set is provided in a single encoding, as follows:
Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.
My current code:
import csv
with open('txt.tsv') as tsvfile:
reader = csv.DictReader(tsvfile, dialect='excel-tab')
for row in reader:
print(row)
All attempts ended with the following error message:
'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
I am a bit lost. Can anyone help me?
csv
module is broken for non-ASCII on Python 2. – Fumaroleutf-8
, so your input likely doesn't follow the format described. That said, the file you linked seems to follow it just fine (it's pure ASCII AFAICT; it uses some unusual ASCII control characters, but they're all in the ASCII range), so I'm not sure where you'd see a\xa0
byte. Is it possible you modified the file by accident before using it? – Fumarolenewline=''
toopen
when working with CSV-like stuff. And theexcel_tab
dialect is wrong here; it assumes line endings are\r\n
, when the file is\n
endings. Defining your own dialect based offexcel_tab
would be an easy solution, just subclass it and set the class level variablelineterminator = '\n'
– Fumarole