Pandas read_csv expects wrong number of columns, with ragged csv file
Asked Answered
M

4

33

I have a csv file that has a few hundred rows and 26 columns, but the last few columns only have a value in a few rows and they are towards the middle or end of the file. When I try to read it in using read_csv() I get the following error. "ValueError: Expecting 23 columns, got 26 in row 64"

I can't see where to explicitly state the number of columns in the file, or how it determines how many columns it thinks the file should have. The dump is below

In [3]:

infile =open(easygui.fileopenbox(),"r")
pledge = read_csv(infile,parse_dates='true')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-b35e7a16b389> in <module>()
      1 infile =open(easygui.fileopenbox(),"r")
      2 
----> 3 pledge = read_csv(infile,parse_dates='true')


C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)
    234         kwds['delimiter'] = sep
    235 
--> 236     return _read(TextParser, filepath_or_buffer, kwds)
    237 
    238 @Appender(_read_table_doc)

C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in _read(cls, filepath_or_buffer, kwds)
    189         return parser
    190 
--> 191     return parser.get_chunk()
    192 
    193 @Appender(_read_csv_doc)

C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in get_chunk(self, rows)
    779             msg = ('Expecting %d columns, got %d in row %d' %
    780                    (col_len, zip_len, row_num))
--> 781             raise ValueError(msg)
    782 
    783         data = dict((k, v) for k, v in izip(self.columns, zipped_content))

ValueError: Expecting 23 columns, got 26 in row 64
Mckelvey answered 22/11, 2013 at 20:54 Comment(4)
Thank you Roman, I was in the middle of figuring out how to make it more readable myself and you beat me to it. :)Mckelvey
np, about your question, do you have header in your file?Marashio
No, no header, looks like that or the answer below would be the way to go.Mckelvey
Maybe of interest: error_bad_lines=False let you skip the lines with too much columns #33441305Edmundson
M
40

You can use names parameter. For example, if you have csv file like this:

1,2,1
2,3,4,2,3
1,2,3,3
1,2,3,4,5,6

And try to read it, you'll receive and error

>>> pd.read_csv(r'D:/Temp/tt.csv')
Traceback (most recent call last):
...
Expected 5 fields in line 4, saw 6

But if you pass names parameters, you'll get result:

>>> pd.read_csv(r'D:/Temp/tt.csv', names=list('abcdef'))
   a  b  c   d   e   f
0  1  2  1 NaN NaN NaN
1  2  3  4   2   3 NaN
2  1  2  3   3 NaN NaN
3  1  2  3   4   5   6

Hope it helps.

Marashio answered 22/11, 2013 at 21:2 Comment(1)
What can I do if I have the names in the first column only with a-c, how can I help to give dummy names?Wrought
D
9

you can also load the CSV with separator '^', to load the entire string to a column, then use split to break the string into required delimiters. After that, you do a concat to merge with the original dataframe (if needed).

temp=pd.read_csv('test.csv',sep='^',header=None,prefix='X')
temp2=temp.X0.str.split(',',expand=True)
del temp['X0']
temp=pd.concat([temp,temp2],axis=1)
Dualistic answered 24/8, 2015 at 22:10 Comment(0)
S
4

Suppose you have a file like this:

a,b,c
1,2,3
1,2,3,4

You could use csv.reader to clean the file first,

lines=list(csv.reader(open('file.csv')))    
header, values = lines[0], lines[1:]    
data = {h:v for h,v in zip (header, zip(*values))}

and get:

{'a' : ('1','1'), 'b': ('2','2'), 'c': ('3', '3')}

If you don't have header you could use:

data = {h:v for h,v in zip (str(xrange(number_of_columns)), zip(*values))}

and then you can convert dictionary to dataframe with

import pandas as pd
df = pd.DataFrame.from_dict(data)
Sarabia answered 22/11, 2013 at 21:13 Comment(1)
if you have the header row you can use csv.DictReaderVerbify
G
1

The problem with the given solution is that you have to know the max number of columns required. I couldn't find a direct function for this problem, but you can surely write a def which can:

  1. read all the lines
  2. split it
  3. count the number of words/elements in each row
  4. store the max number of words/elements
  5. place that max value in the names option (as suggested by Roman Pekar)

Here is the def (function) I wrote for my files:

def ragged_csv(filename):
    f=open(filename)
    max_n=0
    for line in f.readlines():
        words = len(line.split(' '))
        if words > max_n:
            max_n=words
    lines=pd.read_csv(filename,sep=' ',names=range(max_n))
    return lines
Glynas answered 19/1, 2017 at 5:59 Comment(1)
In my particular case, I did know the max number of columns, but this may be useful if you are using external data where that's not immediately available.Mckelvey

© 2022 - 2024 — McMap. All rights reserved.