ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10 error reading csv file
Asked Answered
W

3

9

I am trying to read a csv file using pandas

df1 = pd.read_csv('panda_error.csv', header=None, sep=',')

But I am getting this error:

ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10

For reproducibility, here is the csv file panda_error.csv

superkingdom:Bacteria , phylum:Actinobacteria , class:Actinobacteria , order:Corynebacteriales , family:Corynebacteriaceae , genus:Corynebacterium , species:Corynebacterium efficiens  1
superkingdom:Bacteria , phylum:Proteobacteria , class:Alphaproteobacteria , order:Rhizobiales , family:Aurantimonadaceae , genus:Aurantimonas , species:Aurantimonas manganoxydans  1
superkingdom:Bacteria , phylum:Proteobacteria , subphylum:delta/epsilon subdivisions , class:Deltaproteobacteria , no rank:unclassified Deltaproteobacteria , genus:Candidatus Entotheonella    1
superkingdom:Bacteria , phylum:Proteobacteria , class:Gammaproteobacteria , order:Pseudomonadales , family:Pseudomonadaceae , genus:Pseudomonas , species group:Pseudomonas syringae group , species subgroup:Pseudomonas syringae group genomosp. 2 , species:Pseudomonas amygdali , no rank:Pseudomonas amygdali pv. tabaci   1
superkingdom:Bacteria , phylum:Actinobacteria , class:Actinobacteria , order:Corynebacteriales , family:Nocardiaceae , genus:Rhodococcus , species:Rhodococcus wratislaviensis  1
superkingdom:Bacteria , phylum:Firmicutes , class:Clostridia , order:Clostridiales , family:Peptostreptococcaceae , genus:Peptoclostridium , species:Peptoclostridium difficile1

I am not really sure why this is happening and how to resolve this. The other answers just suggest to 1. ignore the troubling lines using error_bad_lines=False which I do not want to do, or 2. specific to certain scenario.

Here is the full error message if this helps:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-34-72c0ecaf0513> in <module>
----> 1 df1 = pd.read_csv('panda_error.csv', header=None, sep=',')

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    461 
    462     try:
--> 463         data = parser.read(nrows)
    464     finally:
    465         parser.close()

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   1152     def read(self, nrows=None):
   1153         nrows = _validate_integer("nrows", nrows)
-> 1154         ret = self._engine.read(nrows)
   1155 
   1156         # May alter columns / col_dict

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
   2057     def read(self, nrows=None):
   2058         try:
-> 2059             data = self._reader.read(nrows)
   2060         except StopIteration:
   2061             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10
Weevil answered 20/12, 2019 at 4:42 Comment(2)
The error is because till line 4 there are 7 columns (7 commas in the CSV) but at line 4 there are more. Check this link on how to overcome this issue #27020716Tubbs
All the lines in a CSV file need to have the same number of fields, because Pandas dataframes have to be uniform.Indican
S
7

Pandas is a tool to process tabular data. It means that each row should contain the same number of fields. In case of CSV input there is one more requirement, namely fields in each row should be in the same order.

But your input file actually fails to meet both these requirements.

First 2 rows (and probably most others) have 7 fields: superkingdom, phylum, class, order, family, genus and species.

The third row contains: superkingdom, phylum, subphylum, class, no rank and genus. So:

  • there are additional fields (subphylum and no rank),
  • you have no such field like order, family and species.

This does not cause the read_csv to fail, only because the number of fields does not exceed the number of fields in previous rows (there are total 6 fields).

But the real problem is in row 4, where you have 10 fields.

So "ordinary" read_csv is by no means any good choice here. Even if you set the number of columns enough to read all rows, the attributes will be "scattered" among columns in a way difficult to read.

Any attempt to analyze such data based on column names will also fail, because each column will have different information in different rows.

Yet another problem is that data divided on commas will contain e.g. superkingdom:Bacteria, i.e.:

  • the text which should rather be a column (attribute) name,
  • a colon,
  • the actual value.

To overcome these problems, try another approach to read your input file:

  1. Read your input file, using read_csv, but as a single column (sep set to a non-used char).

    df = pd.read_csv('input.csv', sep='|', names=['col1'])
    
  2. The next step, leading to a DataFrame which can be analyzed by a program is extractall (import re is needed):

    df2 = df.col1.str.extractall(
        r'(?P<name>[A-Z ]+[A-Z]):(?P<value>[A-Z /]+[A-Z])', flags=re.I)\
        .reset_index(level=1, drop=True)
    

If you are not fluent in regular expressions, read a little about them.

The result is a DataFrame with 2 columns:

  • name - the attribute name, e.g. superkingdom,
  • value - the attribute value, e.g. Bacteria.

The index is just same as in df - it is the source row number, starting from 0.

For your sample data the result is as follows:

                name                                value
0       superkingdom                             Bacteria
0             phylum                       Actinobacteria
0              class                       Actinobacteria
0              order                    Corynebacteriales
0             family                   Corynebacteriaceae
0              genus                      Corynebacterium
0            species            Corynebacterium efficiens
1       superkingdom                             Bacteria
1             phylum                       Proteobacteria
1              class                  Alphaproteobacteria
1              order                          Rhizobiales
1             family                    Aurantimonadaceae
1              genus                         Aurantimonas
1            species           Aurantimonas manganoxydans
2       superkingdom                             Bacteria
2             phylum                       Proteobacteria
2          subphylum           delta/epsilon subdivisions
2              class                  Deltaproteobacteria
2            no rank     unclassified Deltaproteobacteria
2              genus             Candidatus Entotheonella
3       superkingdom                             Bacteria
3             phylum                       Proteobacteria
3              class                  Gammaproteobacteria
3              order                      Pseudomonadales
3             family                     Pseudomonadaceae
3              genus                          Pseudomonas
3      species group           Pseudomonas syringae group
3   species subgroup  Pseudomonas syringae group genomosp
3            species                 Pseudomonas amygdali
3            no rank              Pseudomonas amygdali pv
4       superkingdom                             Bacteria
4             phylum                       Actinobacteria
4              class                       Actinobacteria
4              order                    Corynebacteriales
4             family                         Nocardiaceae
4              genus                          Rhodococcus
4            species          Rhodococcus wratislaviensis
5       superkingdom                             Bacteria
5             phylum                           Firmicutes
5              class                           Clostridia
5              order                        Clostridiales
5             family                Peptostreptococcaceae
5              genus                     Peptoclostridium
5            species           Peptoclostridium difficile

If you want to have these data as a table with each name converted to respective column, run:

df3 = df2.set_index('name', append=True).unstack(fill_value='')
df3.columns = df3.columns.droplevel()

Take a look at the result and I think it will be more readable than any other attempt.

Salgado answered 20/12, 2019 at 17:23 Comment(1)
I think the alternative df2 = df.col1.str.split(';', expand=True) for step 2 would yield a faster result and simplify the code significantly. (For the seperator, ';' added arbitrarily. Change the character as the real seperator character of the text file).Crank
W
16

This solution worked for me

### Loop the data lines
with open("panda_error.csv", 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]

### Read csv
df = pd.read_csv("panda_error.csv", header=None, delimiter=",", names=column_names)
Weevil answered 20/12, 2019 at 6:25 Comment(0)
S
7

Pandas is a tool to process tabular data. It means that each row should contain the same number of fields. In case of CSV input there is one more requirement, namely fields in each row should be in the same order.

But your input file actually fails to meet both these requirements.

First 2 rows (and probably most others) have 7 fields: superkingdom, phylum, class, order, family, genus and species.

The third row contains: superkingdom, phylum, subphylum, class, no rank and genus. So:

  • there are additional fields (subphylum and no rank),
  • you have no such field like order, family and species.

This does not cause the read_csv to fail, only because the number of fields does not exceed the number of fields in previous rows (there are total 6 fields).

But the real problem is in row 4, where you have 10 fields.

So "ordinary" read_csv is by no means any good choice here. Even if you set the number of columns enough to read all rows, the attributes will be "scattered" among columns in a way difficult to read.

Any attempt to analyze such data based on column names will also fail, because each column will have different information in different rows.

Yet another problem is that data divided on commas will contain e.g. superkingdom:Bacteria, i.e.:

  • the text which should rather be a column (attribute) name,
  • a colon,
  • the actual value.

To overcome these problems, try another approach to read your input file:

  1. Read your input file, using read_csv, but as a single column (sep set to a non-used char).

    df = pd.read_csv('input.csv', sep='|', names=['col1'])
    
  2. The next step, leading to a DataFrame which can be analyzed by a program is extractall (import re is needed):

    df2 = df.col1.str.extractall(
        r'(?P<name>[A-Z ]+[A-Z]):(?P<value>[A-Z /]+[A-Z])', flags=re.I)\
        .reset_index(level=1, drop=True)
    

If you are not fluent in regular expressions, read a little about them.

The result is a DataFrame with 2 columns:

  • name - the attribute name, e.g. superkingdom,
  • value - the attribute value, e.g. Bacteria.

The index is just same as in df - it is the source row number, starting from 0.

For your sample data the result is as follows:

                name                                value
0       superkingdom                             Bacteria
0             phylum                       Actinobacteria
0              class                       Actinobacteria
0              order                    Corynebacteriales
0             family                   Corynebacteriaceae
0              genus                      Corynebacterium
0            species            Corynebacterium efficiens
1       superkingdom                             Bacteria
1             phylum                       Proteobacteria
1              class                  Alphaproteobacteria
1              order                          Rhizobiales
1             family                    Aurantimonadaceae
1              genus                         Aurantimonas
1            species           Aurantimonas manganoxydans
2       superkingdom                             Bacteria
2             phylum                       Proteobacteria
2          subphylum           delta/epsilon subdivisions
2              class                  Deltaproteobacteria
2            no rank     unclassified Deltaproteobacteria
2              genus             Candidatus Entotheonella
3       superkingdom                             Bacteria
3             phylum                       Proteobacteria
3              class                  Gammaproteobacteria
3              order                      Pseudomonadales
3             family                     Pseudomonadaceae
3              genus                          Pseudomonas
3      species group           Pseudomonas syringae group
3   species subgroup  Pseudomonas syringae group genomosp
3            species                 Pseudomonas amygdali
3            no rank              Pseudomonas amygdali pv
4       superkingdom                             Bacteria
4             phylum                       Actinobacteria
4              class                       Actinobacteria
4              order                    Corynebacteriales
4             family                         Nocardiaceae
4              genus                          Rhodococcus
4            species          Rhodococcus wratislaviensis
5       superkingdom                             Bacteria
5             phylum                           Firmicutes
5              class                           Clostridia
5              order                        Clostridiales
5             family                Peptostreptococcaceae
5              genus                     Peptoclostridium
5            species           Peptoclostridium difficile

If you want to have these data as a table with each name converted to respective column, run:

df3 = df2.set_index('name', append=True).unstack(fill_value='')
df3.columns = df3.columns.droplevel()

Take a look at the result and I think it will be more readable than any other attempt.

Salgado answered 20/12, 2019 at 17:23 Comment(1)
I think the alternative df2 = df.col1.str.split(';', expand=True) for step 2 would yield a faster result and simplify the code significantly. (For the seperator, ';' added arbitrarily. Change the character as the real seperator character of the text file).Crank
G
1

To have the column names as the actual names (not numbers) I improved the code to this: (use the delimiter that's appropriate in your case, mine was '\t')

with open("file.csv", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split("\t")) for l in temp_f.readlines() ]

### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]

### Read csv
df = pd.read_csv("file.csv", header=None, delimiter="\t", 
names=column_names)
col_names = list(df.iloc[0])
df = pd.read_csv("file.csv", header=None, delimiter="\t", names=col_names)
df.drop(0, inplace=True); df.reset_index(drop=True, inplace=True)
df
Gabriellegabrielli answered 24/7, 2021 at 19:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.