Pandas read_csv expects wrong number of columns, with ragged csv file

Asked 22/11, 2013 at 20:54 Answered 19/1, 2017 at 5:59

I have a csv file that has a few hundred rows and 26 columns, but the last few columns only have a value in a few rows and they are towards the middle or end of the file. When I try to read it in using read_csv() I get the following error. "ValueError: Expecting 23 columns, got 26 in row 64"

I can't see where to explicitly state the number of columns in the file, or how it determines how many columns it thinks the file should have. The dump is below

In [3]:

infile =open(easygui.fileopenbox(),"r")
pledge = read_csv(infile,parse_dates='true')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-b35e7a16b389> in <module>()
      1 infile =open(easygui.fileopenbox(),"r")
      2 
----> 3 pledge = read_csv(infile,parse_dates='true')


C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)
    234         kwds['delimiter'] = sep
    235 
--> 236     return _read(TextParser, filepath_or_buffer, kwds)
    237 
    238 @Appender(_read_table_doc)

C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in _read(cls, filepath_or_buffer, kwds)
    189         return parser
    190 
--> 191     return parser.get_chunk()
    192 
    193 @Appender(_read_csv_doc)

C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in get_chunk(self, rows)
    779             msg = ('Expecting %d columns, got %d in row %d' %
    780                    (col_len, zip_len, row_num))
--> 781             raise ValueError(msg)
    782 
    783         data = dict((k, v) for k, v in izip(self.columns, zipped_content))

ValueError: Expecting 23 columns, got 26 in row 64

Mckelvey answered 22/11, 2013 at 20:54 Comment(4)

Thank you Roman, I was in the middle of figuring out how to make it more readable myself and you beat me to it. :) – Mckelvey 22/11, 2013 at 20:57

np, about your question, do you have header in your file? – Marashio 22/11, 2013 at 20:58

No, no header, looks like that or the answer below would be the way to go. – Mckelvey 22/11, 2013 at 21:13

Maybe of interest: error_bad_lines=False let you skip the lines with too much columns #33441305 – Edmundson 30/11, 2021 at 14:38

You can use names parameter. For example, if you have csv file like this:

1,2,1
2,3,4,2,3
1,2,3,3
1,2,3,4,5,6

And try to read it, you'll receive and error

>>> pd.read_csv(r'D:/Temp/tt.csv')
Traceback (most recent call last):
...
Expected 5 fields in line 4, saw 6

But if you pass names parameters, you'll get result:

>>> pd.read_csv(r'D:/Temp/tt.csv', names=list('abcdef'))
   a  b  c   d   e   f
0  1  2  1 NaN NaN NaN
1  2  3  4   2   3 NaN
2  1  2  3   3 NaN NaN
3  1  2  3   4   5   6

Hope it helps.

Marashio answered 22/11, 2013 at 21:2 Comment(1)

What can I do if I have the names in the first column only with a-c, how can I help to give dummy names? – Wrought 23/9, 2019 at 13:45

you can also load the CSV with separator '^', to load the entire string to a column, then use split to break the string into required delimiters. After that, you do a concat to merge with the original dataframe (if needed).

temp=pd.read_csv('test.csv',sep='^',header=None,prefix='X')
temp2=temp.X0.str.split(',',expand=True)
del temp['X0']
temp=pd.concat([temp,temp2],axis=1)

Dualistic answered 24/8, 2015 at 22:10 Comment(0)

Suppose you have a file like this:

a,b,c
1,2,3
1,2,3,4

You could use csv.reader to clean the file first,

lines=list(csv.reader(open('file.csv')))    
header, values = lines[0], lines[1:]    
data = {h:v for h,v in zip (header, zip(*values))}

and get:

{'a' : ('1','1'), 'b': ('2','2'), 'c': ('3', '3')}

If you don't have header you could use:

data = {h:v for h,v in zip (str(xrange(number_of_columns)), zip(*values))}

and then you can convert dictionary to dataframe with

import pandas as pd
df = pd.DataFrame.from_dict(data)

Sarabia answered 22/11, 2013 at 21:13 Comment(1)

if you have the header row you can use csv.DictReader – Verbify 17/5, 2016 at 11:6

The problem with the given solution is that you have to know the max number of columns required. I couldn't find a direct function for this problem, but you can surely write a def which can:

read all the lines
split it
count the number of words/elements in each row
store the max number of words/elements
place that max value in the names option (as suggested by Roman Pekar)

Here is the def (function) I wrote for my files:

def ragged_csv(filename):
    f=open(filename)
    max_n=0
    for line in f.readlines():
        words = len(line.split(' '))
        if words > max_n:
            max_n=words
    lines=pd.read_csv(filename,sep=' ',names=range(max_n))
    return lines

Glynas answered 19/1, 2017 at 5:59 Comment(1)

In my particular case, I did know the max number of columns, but this may be useful if you are using external data where that's not immediately available. – Mckelvey 19/1, 2017 at 10:21

Recommended topics

Hot tags