Python3 UnicodeDecodeError with readlines() method
Asked Answered
A

3

46

Trying to create a twitter bot that reads lines and posts them. Using Python3 and tweepy, via a virtualenv on my shared server space. This is the part of the code that seems to have trouble:

#!/foo/env/bin/python3

import re
import tweepy, time, sys

argfile = str(sys.argv[1])

filename=open(argfile, 'r')
f=filename.readlines()
filename.close()

this is the error I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)

The error specifically points to f=filename.readlines() as the source of the error. Any idea what might be wrong? Thanks.

Anear answered 27/1, 2016 at 4:5 Comment(4)
See this post, it has two really helpful answers you should try.Unstrap
I have used the encoding encoding='iso-8859-1', It solved my problemRom
@hsinghal: ISO-8859-1 (aka latin-1) will always work, but it's often wrong. The problem is that it can decode any byte from any encoding, but if the original text isn't really latin-1, it's going to decode to garbage. You need to know the real encoding, not just guess; UTF-8 is mostly self-checking, so it's unlikely to decode binary gibberish, but latin-1 will happily decode binary gibberish to text gibberish and never whisper a word of complaint.Quintilla
@Quintilla Thank you for your explanation. It adds to my current knowledge.Rom
D
72

I think the best answer (in Python 3) is to use the errors= parameter:

with open('evil_unicode.txt', 'r', errors='replace') as f:
    lines = f.readlines()

Proof:

>>> s = b'\xe5abc\nline2\nline3'
>>> with open('evil_unicode.txt','wb') as f:
...     f.write(s)
...
16
>>> with open('evil_unicode.txt', 'r') as f:
...     lines = f.readlines()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte
>>> with open('evil_unicode.txt', 'r', errors='replace') as f:
...     lines = f.readlines()
...
>>> lines
['�abc\n', 'line2\n', 'line3']
>>>

Note that the errors= can be replace or ignore. Here's what ignore looks like:

>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:
...     lines = f.readlines()
...
>>> lines
['abc\n', 'line2\n', 'line3']
Dazzle answered 14/1, 2017 at 17:27 Comment(0)
Q
24

Your default encoding appears to be ASCII, where the input is more than likely UTF-8. When you hit non-ASCII bytes in the input, it's throwing the exception. It's not so much that readlines itself is responsible for the problem; rather, it's causing the read+decode to occur, and the decode is failing.

It's an easy fix though; the default open in Python 3 allows you to provide the known encoding of an input, replacing the default (ASCII in your case) with any other recognized encoding. Providing it allows you to keep reading as str (rather than the significantly different raw binary data bytes objects), while letting Python do the work of converting from raw disk bytes to true text data:

# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
    f = inf.readlines()

If the file is some other encoding, you'd change encoding='utf-8' to the appropriate argument. Note that while some people will tell you to "Just use 'latin-1'" here if 'utf-8' doesn't work":

  1. That's often wrong (modern text editors tend to produce UTF-8 or UTF-16, with latin-1 being much less common; frankly, you're more likely to see Microsoft's 'latin-1' variant, 'cp1252', that's mostly the same but remaps some characters to support stuff like smart quotes), and
  2. Unlike the UTF encodings, the various byte-per-character ASCII superset encodings (including 'latin-1', 'cp1252', 'cp437', and many others) are not self-checking; if the data isn't in the encoding specified, they'll still happily decode it, it will just produce gibberish for stuff above the ASCII range.

In short, if your data isn't a UTF encoding (or one of the rare non-UTF self-checking encodings), you need to know the encoding used, or you're stuck guessing and checking the result to see if it makes sense (and for stuff like a source that might be latin-1 or cp1252, you'll never be sure unless it eventually contains a cp1252-specific character).

Quintilla answered 27/1, 2016 at 17:24 Comment(5)
I like the simplicity of this solution but I just tried it in python 3.6.8 and it fails.Standley
@M.H.: It will work on UTF-8 data. If it's not UTF-8, you need to figure out what it is. This will work just as well on 3.6.8 as on any other 3.x release (and on Python 2.6+ for that matter, if you do from io import open to replace the Py2 open with the Py3 version). If you don't know the encoding though, you're stuck guessing.Quintilla
@r_e_cur: I rejected your edit because, even if your case happened to work with latin-1, latin-1 is a trap, and should not be anyone's first (or second, or third) attempt to solve the issue unless they know, without a shadow of a doubt, that the source data is actually in latin-1. It'll "work" with completely random bytes, and UTF-8 bytes, and UTF-16 bytes; decoding them all as latin-1 will get you a string, but that string will be garbage. UTF-8 is self-checking and therefore any meaningful amount of data will error if it's not really UTF-8, making it a much safer choice.Quintilla
I did add notes on using it, but rather than including it as a code sample that will be copied and pasted without thinking, I made notes on why not to use it, and when you can use it. I strongly suspect latin-1 is wrong for you even if you say it works, because on most Western European Windows systems, cp1252 (which is similar to latin-1, but not exactly the same) is the actual default locale encoding (when the data isn't stored as UTF-16, which most Windows programs use nowadays), and on basically every non-Windows system outside of East Asia (and even some in it), UTF-8 is the default.Quintilla
Oh, hmm. Misread, it wasn't r_e_cur who proposed the edit, it was an "anonymous user". I didn't even realize that was a thing on StackOverflow. shrugs I'll leave these comments in place if they ever come back to check.Quintilla
A
-1

Ended up finding a working answer for myself:

filename=open(argfile, 'rb')

This post helped me out a lot.

Anear answered 27/1, 2016 at 17:7 Comment(1)
If you're actually using Python 3, this is going to dramatically change your behavior; opening in binary mode means not only do you not get line ending translation (admittedly only an issue on Windows), but you get back bytes objects instead of str (and must manually decode them if you want to work with str). I posted an answer that avoids this (assuming you know the encoding, which you'd need to know to perform the decode anyway).Quintilla

© 2022 - 2024 — McMap. All rights reserved.