Your default encoding appears to be ASCII, where the input is more than likely UTF-8. When you hit non-ASCII bytes in the input, it's throwing the exception. It's not so much that readlines
itself is responsible for the problem; rather, it's causing the read+decode to occur, and the decode is failing.
It's an easy fix though; the default open
in Python 3 allows you to provide the known encoding
of an input, replacing the default (ASCII in your case) with any other recognized encoding. Providing it allows you to keep reading as str
(rather than the significantly different raw binary data bytes
objects), while letting Python do the work of converting from raw disk bytes to true text data:
# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
f = inf.readlines()
If the file is some other encoding, you'd change encoding='utf-8'
to the appropriate argument. Note that while some people will tell you to "Just use 'latin-1'
" here if 'utf-8'
doesn't work":
- That's often wrong (modern text editors tend to produce UTF-8 or UTF-16, with latin-1 being much less common; frankly, you're more likely to see Microsoft's
'latin-1'
variant, 'cp1252'
, that's mostly the same but remaps some characters to support stuff like smart quotes), and
- Unlike the UTF encodings, the various byte-per-character ASCII superset encodings (including
'latin-1'
, 'cp1252'
, 'cp437'
, and many others) are not self-checking; if the data isn't in the encoding specified, they'll still happily decode it, it will just produce gibberish for stuff above the ASCII range.
In short, if your data isn't a UTF encoding (or one of the rare non-UTF self-checking encodings), you need to know the encoding used, or you're stuck guessing and checking the result to see if it makes sense (and for stuff like a source that might be latin-1 or cp1252, you'll never be sure unless it eventually contains a cp1252-specific character).