Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.
You can tell open()
how to treat decoding errors, with the errors
keyword:
errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error()
is also valid. The standard names are:
'strict'
to raise a ValueError
exception if there is an encoding error. The default value of None
has the same effect.
'ignore'
ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace'
causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape'
will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape
error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace'
is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;
.
'backslashreplace'
(also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.
Opening the file with anything other than 'strict'
('ignore'
, 'replace'
, etc.) will then let you read the file without exceptions being raised.
Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape
handler and test each line read for codepoints in the surrogate range:
import re
_surrogates = re.compile(r"[\uDC80-\uDCFF]")
def detect_decoding_errors_line(l, _s=_surrogates.finditer):
"""Return decoding errors in a line of text
Works with text lines decoded with the surrogateescape
error handler.
Returns a list of (pos, byte) tuples
"""
# DC80 - DCFF encode bad bytes 80-FF
return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
for m in _s(l)]
E.g.
with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
for i, line in enumerate(f, 1):
errors = detect_decoding_errors_line(line)
if errors:
print(f"Found errors on line {i}:")
for (col, b) in errors:
print(f" {col + 1:2d}: {b[0]:02x}")
Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError
exception if the 'line' is large enough.
codecs.open()
in Python 3?open()
handles UTF-8 just fine. – Vaporus-ascii
as shown by thefile
output, since it contains non-ascii characters. – Goffcodecs.open()
here, as it is inferior toopen()
. – Vapor