Pandas dataframe read_csv on bad data
Asked Answered
T

5

102

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There is around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other csv's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:

def loadCSV(filePath):
    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)
    datakeys = dataframe.keys();
    return dataframe, datakeys
Thrust answered 30/10, 2015 at 16:36 Comment(1)
In addition to the above, using warn_bad_lines=True may further help to diagnose the problematic rows.Duckboard
D
140

pass error_bad_lines=False to skip erroneous rows:

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

Diaconicum answered 30/10, 2015 at 16:39 Comment(8)
I forgot to mention that one. It will work for what I currently want to do, but down the road I will want the row just without the incidental extra value.Thrust
An alternative is to read a single line to get the correct number of cols and then re-read again to read only those columns e.g. cols = pd.read_csv(file, nrows=1).columns df = pd.read_csv(file, usecols=cols) this will then ignore the additional column I think for that error line try that and let me know if it works for youDiaconicum
@Fonti: There is no option for truncate_bad_lines. Doing so would be a bad practice. You assume that you know in advance why the data is bad (it had an extra value appended to it). But what if it has too few columns? What if the extra value was inserted rather than appended? Doing something like this is a bug magnet.Intestinal
Actually I just tried this and it won't work, I think what you'd need to is pass error_bad_lines=False and parse the warnings to get the line numbers and read just those lines with header=NoneDiaconicum
@Edchum That's odd, I think it worked for me. Instead of getting an error at that line I got a memory error (4million+ rows). A whole nother fish to tackle.Thrust
Not sure what you mean but passing error_bad_lines=False skips those rowsDiaconicum
@StevenRumbalski I definitely agree. I'm not sure why it occasionally happens in the data as I don't produce it, but I've stumbled on some other very weird errors in the data entry. In this specific case all columns with headers at least have a 0 or null value, so the issues only occur when there are too many column entries.Thrust
Misread what you were referring to (the alternative). error_bad_lines works for this situation.Thrust
G
37

Earlier answers suggest using error_bad_lines=False and warn_bad_lines=True, but now both are being deprecated in pandas.

Instead, use on_bad_lines='warn' to achieve the same effect to skip over bad data lines. It will raise a warning when a bad line is encountered and skip that line.

dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
                        on_bad_lines='warn')

Other acceptable values for on_bad_lines are

  • 'error' which raises an Exception on a bad line
  • 'skip' which will skip any bad lines
Greenock answered 26/1, 2022 at 17:25 Comment(0)
M
29

To get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:

dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
                        warn_bad_lines=True, error_bad_lines=False)

error_bad_lines=False skips error-causing rows and warn_bad_lines=True prints error details and row number, like this:

'Skipping line 3: expected 4 fields, saw 3401\nSkipping line 4: expected 4 fields, saw 30...'

If you want to save the warning message (i.e. for some further processing), then you can save it to a file too (with use of contextlib):

import contextlib

with open(r'D:\Temp\log.txt', 'w') as log:
    with contextlib.redirect_stderr(log):
        dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', 
                                warn_bad_lines=True, error_bad_lines=False)
Malaya answered 22/7, 2020 at 10:18 Comment(0)
C
17

New with 1.4.0

Starting with pandas 1.4.0, read_csv() delivers capability that allows you to handle these situations in a more graceful and intelligent fashion by allowing a callable to be assigned to on_bad_lines=.

For example, assume a CSV that could cause a bad data error: Expected 4 fields in line 3, saw 5:

C1,C2,C3,C4
10,11,12,13
25,26,27,28,garbage
80,81,82,83

The following lambda function simply ignores the last column in the bad line (as was desired in the original problem statement above):

df = pd.read_csv('your.csv', on_bad_lines=lambda x: x[:-1], engine='python')
df

   C1  C2  C3  C4
0  10  11  12  13
1  25  26  27  28
2  80  81  82  83

The on_bad_lines callable function is called on each bad line and has a function signature (bad_line: list[str]) -> list[str] | None. If the function returns None, the bad line will be ignored. As you can see engine='python' is required.

The great thing about this is that it opens the door big-time for whatever fine-grained logic you want to code to fix the problem.

For example, say you'd like to remove bad data from the start or the end of the line and simply ignore the line if there is bad data in both the start and the end you could:

CSV

C1,C2,C3,C4
10,11,12,13
20,21,22,23,garbage
60,61,62,63
trash,80,81,82,83
trash,90,91,82,garbage

Function Definition

def line_fixer(x):
    if not x[0].isnumeric() and x[-1].isnumeric():
        return x[1:] 
    
    if not x[-1].isnumeric() and x[0].isnumeric():
        return x[:-1]
    
    return None

Result

df = pd.read_csv('your.csv', on_bad_lines=line_fixer, engine='python')
df

   C1  C2  C3  C4
0  10  11  12  13
1  20  21  22  23
2  60  61  62  63
3  80  81  82  83
Carmel answered 28/4, 2022 at 22:53 Comment(3)
Does it skip over rows that have less values than required. i.e, would it skip over a row that has 4 values, when all other rows have 5 values each?Foot
@VeerajaVeeraesh read_csv does not have the capability to allow you to skip short rows.Carmel
Maybe a old post, but exists anyway to store this lines? I have tried something like that but it's not working def bad_lines_collect(bad_line): badlines_list.append(bad_line)Builtup
F
-2

here is my way to solve those problem, it is slow but works so well, Simply says just read the CSV file as txt file, and go through each line. if the "," comma is less than it should be just skip that row. eventurally safe the correct lines.

def bad_lines(path):
    import itertools
    num_columns = []
    lines = ""
    
    for i in range(10,50,5):
        content = open(path).readlines(i)[0]
        if (content.count("'") == 0) and (content.count('"') == 0):
            num_columns.append(content.count(","))

    if len(set(num_columns)) == 1:
        for line in itertools.islice(open(path), 0, None):
            if line.count(",") >= num_columns[0]:
                lines = lines + line

    text_file = open("temp.txt", "w")
    n = text_file.write(lines)
    text_file.close()
    
    return("temp.txt")
Fraternal answered 20/9, 2020 at 3:56 Comment(1)
There are MUCH better ways of handling this as your answer is VERY error prone. Please consider revising or removing this answer.Shelton

© 2022 - 2024 — McMap. All rights reserved.