New with 1.4.0
Starting with pandas
1.4.0
, read_csv()
delivers capability that allows you to handle these situations in a more graceful and intelligent fashion by allowing a callable to be assigned to on_bad_lines=
.
For example, assume a CSV
that could cause a bad data error: Expected 4 fields in line 3, saw 5
:
C1,C2,C3,C4
10,11,12,13
25,26,27,28,garbage
80,81,82,83
The following lambda function simply ignores the last column in the bad line (as was desired in the original problem statement above):
df = pd.read_csv('your.csv', on_bad_lines=lambda x: x[:-1], engine='python')
df
C1 C2 C3 C4
0 10 11 12 13
1 25 26 27 28
2 80 81 82 83
The on_bad_lines
callable function is called on each bad line and has a function signature (bad_line: list[str]) -> list[str] | None
. If the function returns None
, the bad line will be ignored. As you can see engine='python'
is required.
The great thing about this is that it opens the door big-time for whatever fine-grained logic you want to code to fix the problem.
For example, say you'd like to remove bad data from the start or the end of the line and simply ignore the line if there is bad data in both the start and the end you could:
CSV
C1,C2,C3,C4
10,11,12,13
20,21,22,23,garbage
60,61,62,63
trash,80,81,82,83
trash,90,91,82,garbage
Function Definition
def line_fixer(x):
if not x[0].isnumeric() and x[-1].isnumeric():
return x[1:]
if not x[-1].isnumeric() and x[0].isnumeric():
return x[:-1]
return None
Result
df = pd.read_csv('your.csv', on_bad_lines=line_fixer, engine='python')
df
C1 C2 C3 C4
0 10 11 12 13
1 20 21 22 23
2 60 61 62 63
3 80 81 82 83
warn_bad_lines=True
may further help to diagnose the problematic rows. – Duckboard