UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?
Asked Answered
D

1

54

I have to read a text file into Python. The file encoding is:

file -bi test.csv 
text/plain; charset=us-ascii

This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ö, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.

My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.

Is there a way to avoid this. If I try this:

fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
    for line in companiesFile:
        print(line, end="");
except UnicodeDecodeError:
    pass;

then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.

Is there any way to do this? Thank you very much.

Drunkard answered 7/7, 2014 at 17:49 Comment(6)
Why are you using codecs.open() in Python 3? open() handles UTF-8 just fine.Vapor
I also tried using open, I get the same errorDrunkard
Do you know what encoding the file is really using? It's clearly not us-ascii as shown by the file output, since it contains non-ascii characters.Goff
@Chicoscience: I wasn't addressing your problem; I was puzzled as to why you were using codecs.open() here, as it is inferior to open().Vapor
Not a problem, Martijn, thanks! Dano, that is strange to me as well, the encoding says ascii but it is clearly not asciiDrunkard
See also: set the implicit default encoding\decoding error handling in pythonLai
V
105

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

You can tell open() how to treat decoding errors, with the errors keyword:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

  • 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
  • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
  • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

import re

_surrogates = re.compile(r"[\uDC80-\uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

E.g.

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

Vapor answered 7/7, 2014 at 18:14 Comment(14)
I tried to find an alternate solution by catching the decoding exceptions themselves. Unfortunately it appears (in Python 2 at least) that the decoding occurs before line endings are detected, so you don't get consistent results - you might lose more than one line, or you might get hung on the same buffer forever.Caulis
@MartijnPieters The issue with ignore is that it will ignore invalid characters and not the whole line...so, I'd like to use strict and catch the Exception, to do finer-grained error handling. But like OP, I can't figure out how to do this with the for loop...Inheritor
@Inheritor You can’t because decoding is done per block of file data, not per line. There is a work-around: use an error handler that replaces erroneous characters then look for the replacements in each line read. If you use surrogateescape as the error handler you can even recover the problematic bytes. I’ve added example code to the answer.Vapor
@MarkRansom same idea for you, albeit 5 years late.Vapor
@MartijnPieters Aha! This does it. But I don't understand one thing: if Python doesn't decode by line, then how do ignore and surrogateescape work? Don't they furnish one line at a time?Inheritor
@Inheritor No, they operate on the same block when decoding. An encoding has no special knowledge of line delimiters. Multi-byte codecs (UTF-16 & UTF-32 specifically) encode newline characters using more than one byte, which means you can’t split text into lines without decoding first. I am not sure where the confusion lies here?Vapor
@MartijnPieters I understand what you're saying about the decoding - a block must be decoded to find the newlines first. What I don't see, is why Python doesn't provide an API which lets the dev catch the DecodeError by line? It already does this split by newline delimiter for ignore and surrogateescape, so why not do it for strict error handling, too?Inheritor
To be more concrete, take this snippet: gist.github.com/flow2k/8bd4fece21fa1a0b75737a3d9fc2e86c. I'm using readline() here to try to catch the exception by line. But I found it doesn't work: when the exception is thrown, the rest of the block is skipped and next readline() returns the next block. Python could have set the seek position to the previous newline delimiter in the original block so there is no skipping, but somehow it wasn't designed this way.Inheritor
@flow2k: newline detection is a completely separate task done after decoding. There is no special handling in the error handlers for this. Either decoding the block succeeds or it fails, and if it succeeds a later stage can detect line separators and produce individual string objects for each line. All that a different error handler does is influence how bad data is handled when decoding.Vapor
@flow2k: also, it is entirely possible to corrupt the remainder of your input stream by dropping or inserting invalid bytes. That means it is impossible to know anything about line separator characters and so about lines, to attribute errors to.Vapor
@Inheritor last but not least: file data and other streams have no concept of lines, only of a sequence of bytes. Only when you interpret those bytes (assign meaning to them via a codec) can you start to designate some of those bytes (or a specific sequence of bytes) as a line separator, and everything between the line separators as lines. That all means that without decoding, there are no lines. If decoding fails, you can’t say, with 100% accuracy for all inputs, what line an error applies to.Vapor
@Inheritor what the surrogateescape approach gives you is that you say to the decoder: please soldier on, give me the bad data wrapped up in special codepoints, and hope for the best. We’ll just trust that what comes after isn’t too badly corrupted and we can pretend that line separators are still line separators.Vapor
@MartijnPieters What you are saying makes sense. But why can't we also say to the strict decoder: you see bad data, okay, but please soldier on until you see what appears to be a line separator, and then set the seek position there. After you've done that, throw an Exception. I wanted you to set the seek position because the next call to readline() can start immediately after the line separator.Inheritor
@flow2k: this is going round in circles now. No, you can't say that to a decoder because decoders have no knowledge of line separators. The error happens in a block of bytes, it can be before a line separator or after. There can be many line separators or zero. The decoder should not care nor can it. You can't ask a decoder to continue to a next line separator because there might not be a next line. Decoders are engineered to also work on streaming data (say, from a network connection), and so don't know how much data is still to follow, or when it'll be available.Vapor

© 2022 - 2024 — McMap. All rights reserved.