How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb?
I don't want to use Spark
as i want to use a model in SkLearn so I want the solution in Pandas
itself.
My code is:
allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)
But this is failing as the total size of all the csv in my path is 17gb.
I want to read it in chunks but I getting some error if I try like this:
allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
df.reset_index(drop=True, inplace=True)
The error I get is this:
"cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid"
Can someone help?
df = pd.concat([pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles])
. I.e. with square brackets? I Think the normal brackets give you a generator... – Rabideaupd.read_csv
it looks like specifying thechunksize
argument makes the method call return aTextFileReader
object (rather than a dataframe) which has to be iterated over. – Rabideau