'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
Asked Answered
D

8

49

I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)

According to the SEC the data set is provided in a single encoding, as follows:

Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.

My current code:

import csv

with open('txt.tsv') as tsvfile:
    reader = csv.DictReader(tsvfile, dialect='excel-tab')
    for row in reader:
        print(row)

All attempts ended with the following error message:

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I am a bit lost. Can anyone help me?

Depalma answered 2/1, 2018 at 20:36 Comment(6)
Can we see the file you are using?Gymnast
Also, is this Python 2 or 3? The answer is very important, since the csv module is broken for non-ASCII on Python 2.Fumarole
I am using Python 3.6.0Depalma
Hmm... On rereading the error, I'm pretty sure the problem is your input file. The error indicates it is trying to read it as utf-8, so your input likely doesn't follow the format described. That said, the file you linked seems to follow it just fine (it's pure ASCII AFAICT; it uses some unusual ASCII control characters, but they're all in the ASCII range), so I'm not sure where you'd see a \xa0 byte. Is it possible you modified the file by accident before using it?Fumarole
see below the answer of Kopytok. if I change the encoding to 'windows-1252' it works perfect.Depalma
A side-note: You should be passing newline='' to open when working with CSV-like stuff. And the excel_tab dialect is wrong here; it assumes line endings are \r\n, when the file is \n endings. Defining your own dialect based off excel_tab would be an easy solution, just subclass it and set the class level variable lineterminator = '\n'Fumarole
J
79

Encoding in the file is 'windows-1252'. Use:

open('txt.tsv', encoding='windows-1252')
Jebel answered 2/1, 2018 at 21:0 Comment(7)
Thank you very much!! That works! May I ask you why it works with 'windows-1252' although the SEC states it is 'utf-8'?Depalma
Are you sure it's cp1252? The file I downloaded appeared to be ASCII. If it's not UTF-8, and not ASCII, it could be literally any single-byte-per-character ASCII superset and you'd only be able to guess at the encoding heuristically (it would successfully decode under any of them, but the results might be garbage).Fumarole
@Depalma Better ask SECJebel
@Fumarole encoding detector detected cp-1252 and the result seems to be legitJebel
This has the potential of producing invalid results. CP-1252 will happily decode anything (audio data, core dumps, zip archives) and pretend it's all valid text.Ellswerth
Casual inspection of my download of txt.tsv indicates no 0xa0 character at the offset indicated in the question, but plenty of 0xa0 characters which are apparently representing hard spaces, and 0xac characters in a position which indicates a currency indicator as well as 0xae which apparently is the ®‎ symbol. This is almost consistent with CP1252 or ISO-8859-1 (which of course are very similar), but the 0xac doesn't fit with either. Maybe see also cdn.rawgit.com/tripleee/8bit/master/encodings.html#ac (cough.)Ellswerth
In my case, I had a text file with Windows CRLF instead of Unix LF.Brasserie
D
6

If someone works on Turkish data, then I suggest this line:

df = pd.read_csv("text.txt",encoding='windows-1254')
Diorio answered 13/11, 2018 at 8:33 Comment(0)
M
3
ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252') 

Works fine for me, thanks.

Mirisola answered 3/3, 2019 at 21:11 Comment(0)
Q
3

I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn't work.

df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')
Quadrennium answered 9/2, 2022 at 3:2 Comment(0)
C
2

If the input has a stray '\xa0', then it's not in UTF-8, full stop.

Yes, you have to either recode it to UTF-8 (see: iconv, recode commands, or a lot of text editors and IDEs can do it), or read it using an 8-bit encoding (as all the other answers suggest).

What you should ask yourself is - what is this character after all (0xa0 or 160)? Well, in many 8-bit encodings it's a non-breaking space (like   in HTML). For at least one DOS encoding it's an accented "a" character. That's why you need to look at the result of decoding it from the 8-bit encoding.

BTW, sometimes people say "UTF-8", and they mean "mostly ASCII, I guess". And if it was a non-breaking space, they weren't that far:

In [1]: '\xa0'.encode()
Out[1]: b'\xc2\xa0'

One exptra preceeding '\xc2' byte would do the trick.

Charitycharivari answered 26/2, 2021 at 9:0 Comment(0)
C
1

i have the same error message for .csv file, and This Worked for me :

     df = pd.read_csv('Text.csv',encoding='ANSI')
Conni answered 29/1, 2019 at 11:41 Comment(0)
D
0

I was able to open a csv file that gave me that answer, recoding the file by opening it in a notepad and saving it in utf-8, there it was able to open later without problems

Deherrera answered 1/4 at 10:20 Comment(1)
This looks like a "thank you" answer for Tomasz Gandor's answer. Please don't add "thank you" as an answer. Once you have sufficient reputation, you will be able to vote up questions and answers that you found helpful. - From ReviewQuits
G
0

We have to handle files from several sources and could have any encoding. We found TextIOWrapper with error handling very useful; in our case: errors='replace'. Docs: https://docs.python.org/3/library/io.html#io.TextIOWrapper

Galenism answered 17/5 at 19:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.