Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.]
Asked Answered
H

6

92

So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The Error occurs while reading. I am Attaching the code and the Error below.

import pandas as pd
import numpy as np
import glob

path =r'somePath' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
store = pd.concat(list_)
store.to_csv("C:\work\DATA\Raw_data\\store.csv", sep=',', index= False)
store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',')

Error:-

CParserError                              Traceback (most recent call last)
<ipython-input-48-2983d97ccca6> in <module>()
----> 1 store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',')

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    258         return parser
    259 
--> 260     return parser.read()
    261 
    262 _parser_defaults = {

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    719                 raise ValueError('skip_footer not supported for iteration')
    720 
--> 721         ret = self._engine.read(nrows)
    722 
    723         if self.options.get('as_recarray'):

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
   1168 
   1169         try:
-> 1170             data = self._reader.read(nrows)
   1171         except StopIteration:
   1172             if nrows is None:

pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7544)()

pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7784)()

pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8401)()

pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8275)()

pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:20691)()

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I tried using csv reader as well:-

import csv
with open("C:\work\DATA\Raw_data\\store.csv", 'rb') as f:
    reader = csv.reader(f)
    l = list(reader)

Error:-

Error                                     Traceback (most recent call last)
<ipython-input-36-9249469f31a6> in <module>()
      1 with open('C:\work\DATA\Raw_data\\store.csv', 'rb') as f:
      2     reader = csv.reader(f)
----> 3     l = list(reader)

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Hardiness answered 30/11, 2015 at 12:30 Comment(0)
P
273

I found this error, the cause was that there were some carriage returns "\r" in the data that pandas was using as a line terminator as if it was "\n". I thought I'd post here as that might be a common reason this error might come up.

The solution I found was to add lineterminator='\n' into the read_csv function like this:

df_clean = pd.read_csv('test_error.csv',
                 lineterminator='\n')
Paraffinic answered 10/1, 2018 at 12:4 Comment(6)
very apt, not easy to figure this out from the error message aloneAttired
Oh my god, I was going crazy for hours and tried numerous different delimiters and nothing would work. Thank you so much!Chirurgeon
I was writing the csv files and later reading them back in, and I got this error (running on linux, this did not happen on windows). Only adding the lineterminator='\n' alone was not sufficient, I had to add encoding='utf-8-sig' to the to_csv when writing the csv files.Thermomagnetic
I used to work with a column of freely-typed text input. I knew this should be obvious, I've thought of it before. Yet it never occurred to me when I actually encounter such errors. (+1).Decasyllable
You can also find/replace [ctrl]+[enter] with '\n' in vscode if you have an existing csv you want to use.Nauseate
I wish I could upvote this 100 times.Amado
F
53

If you are using python and its a big file you may use engine='python' as below and should work.

df = pd.read_csv( file_, index_col=None, header=0, engine='python' )

Far answered 11/12, 2018 at 19:58 Comment(2)
In the future, please use markdown to format your posts and responses.Rustyrut
At the time of my writing, engine = 'python' breaks down when encountering NULL byte and encourages to use engine = 'c' in the error message.Decasyllable
G
21

Not an answer, but too long for a comment (not speaking of code formatting)

As it breaks when you read it in csv module, you can at least locate the line where the error occurs:

import csv
with open(r"C:\work\DATA\Raw_data\store.csv", 'rb') as f:
    reader = csv.reader(f)
    linenumber = 1
    try:
        for row in reader:
            linenumber += 1
    except Exception as e:
        print (("Error line %d: %s %s" % (linenumber, str(type(e)), e.message)))

Then look in store.csv what happens at that line.

Gamboge answered 30/11, 2015 at 12:53 Comment(3)
Sorry for the late response, had a look at the csv there were some unicode characters like \r, -> etc that led to unexpected escapes. Replacing them in the source did the trick. Your answer helped a lot in visualizing them.Hardiness
For Python 3, just open the file with 'r' i.e., open('file.csv', 'r') for this to workRici
This problem can also be resolved by simply saving the file from excel sheet menu option by saving as "csv" format.Ivey
L
1

Change the directory to the CSV

Corpus = pd.read_csv(r"C:\Users\Dell\Desktop\Dataset.csv",encoding='latin-1')
Lorie answered 29/3, 2021 at 8:3 Comment(0)
B
0

the problem is from the format of the excel file. We select Save as Options from menu and change the format from xls to csv, then it will surely work.

Barn answered 25/5, 2022 at 16:45 Comment(0)
B
0

In my case, the solution was to specify encoding to utf-16 as per the following answer: https://mcmap.net/q/242413/-_csv-error-line-contains-nul-from-a-downloaded-csv

pd.read_csv("C:\work\DATA\Raw_data\store.csv", sep=',', encoding='utf-16')
Bridgeport answered 28/2, 2023 at 8:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.