Read multiple CSV files in Pandas in chunks
Asked Answered
I

4

6

How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb?

I don't want to use Spark as i want to use a model in SkLearn so I want the solution in Pandas itself.

My code is:

allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)

But this is failing as the total size of all the csv in my path is 17gb.

I want to read it in chunks but I getting some error if I try like this:

  allFiles = glob.glob(os.path.join(path, "*.csv"))
  df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
  df.reset_index(drop=True, inplace=True)

The error I get is this:

"cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid"

Can someone help?

Impedimenta answered 4/3, 2019 at 16:38 Comment(6)
Have you tried df = pd.concat([pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles]). I.e. with square brackets? I Think the normal brackets give you a generator...Rabideau
@Rabideau no didn't help. Same errorImpedimenta
Looking at the documentation for pd.read_csv it looks like specifying the chunksize argument makes the method call return a TextFileReader object (rather than a dataframe) which has to be iterated over.Rabideau
If you don't have the necessary RAM to hold the result of all the chunks then it doesn't matter if you read it in chunks or not... you're still going to make your machine go boom...Bibbie
@JonClements - I have the RAM.Its just that its making this read process very slow and bringing down the performanceImpedimenta
Can you provide your hardware requirements? How big is your RAM? How many CPUs/GPUs you have?Easter
B
0

to read large csv file , you could use chunksize but in this case you have to use iterator like this:

for df in pd.read_csv('file.csv', sep=',', iterator=True, chunksize=10000):
    process(df)

you have to concat or append each chunk

or you could do that:

df = pd.read_csv('file.csv',, sep=',', iterator=True, chunksize=10000)
for chunk in df:
    process(chunk)

to read multiple file: for example

listfile = ['file1,'file2]
dfx = pd.DataFrame()
def process(d):
    #dfx=dfx.append(d) or dfx = pd.concat(dfx, d)
    #other coding

for f in listfile:
    for df in pd.read_csv(f, sep=',', iterator=True, chunksize=10000):
        process(df)

after you have lot of files you could use DASK or Pool from multiprocessing library to launch lot of reading process

Anyways, either you have enough memory, either you loss time

Bobker answered 4/3, 2019 at 17:10 Comment(3)
this works fine if I am trying to read 1 csv .But in my case i have a bunch of csv's in my folder.So read_csv("file.csv") is using just 1 file here but i have a folder with lot of csv files.I am thats why concatenating them in my code exampleImpedimenta
@Impedimenta i suggest you to drop a maximum of data unused for your machine learning..i have modifed my program to read multiple files..Bobker
thanks but the moment I try to concat the data i get the same errorImpedimenta
C
0

This is an interesting question. I haven't tried this, but I think the code would look something like the script below.

import pandas as pd
import csv
import glob
import os

#os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
filelist = glob.glob("C:\\your_path\\*.csv")
#dfList=[]
for filename in filelist:
    print(filename)  
    namedf = pd.read_csv(filename, skiprows=0, index_col=0)
    results = results.append(namedf)

results.to_csv('C:\\your_path\\Combinefile.csv')


chunksize = 10 ** 6
for chunk in pd.read_csv('C:\\your_path\\Combinefile.csv', chunksize=chunksize):
    process(chunk)

Maybe you could load everything into memory and process it directly, but it would probably take a lot longer to process everything.

Cathrinecathryn answered 11/3, 2019 at 18:20 Comment(0)
C
0

One way to do this is to chunk the data frame with pd.read_csv(file, chunksize=chunksize) and then if the last chunk you read is shorter than the chunksize, save the extra bit and then add it onto the first file of the next chunk.

But making sure to read in a smaller first chunk of the next file so that it equals the total chunk size.

def chunk_from_files(dir, master_chunksize):
    '''
    Provided a directory, loops through files and chunks out dataframes.
    :param dir: Directory to csv files.
    :param master_chunksize: Size of chunk to output.
    :return: Dataframes with master_chunksize chunk.
    '''
    files = os.listdir(dir)

    chunksize = master_chunksize
    extra_chunk = None # Initialize the extra chunk.
    for file in files:
        csv_file = os.path.join(dir, file)

        # Alter chunksize if extra chunk is not None.
        if extra_chunk is not None:
            chunksize = master_chunksize - extra_chunk.shape[0]

        for chunk in pd.read_csv(csv_file, chunksize=chunksize):
            if extra_chunk is not None: 
                # Concatenate last small chunk of previous file with altered first chunk of next file.
                chunk = pd.concat([chunk, extra_chunk])
                extra_chunk = None
                chunksize = master_chunksize # Reset chunksize.
            elif chunk.shape[0] < chunksize:
                # If last chunk is less than chunk size, set is as the extra bit.
                extra_chunk = chunk
                break

            yield chunk
Cornemuse answered 8/4, 2020 at 22:52 Comment(1)
This does not work. The problem is that one cannot change chunksize after running pd.read_csv... use print(f"actual size {chunk.shape[0]} expected size chunksize={chunksize}") after the loop over the chunks..Contexture
C
0

I am a bit late to answer but maybe it helps others. This solution is actually inspired by another answer (https://mcmap.net/q/1841596/-read-multiple-csv-files-in-pandas-in-chunks) which is not working (at least not for me, see comment there). All the other solutions (at the time of writing) require to load all files into memory which is not satisfactory and might not be possible.

In the following we use iterator=True in pd.read_csv which allows to use get_chunk to retrieve chunks of a certain chunksize.

import logging
import sys
import pandas as pd

LOG = logging.getLogger(__name__)


def chunk_from_files(files, chunksize):
    """Loop over files and return as much equally sized chunks as possible."""
    leftover_chunk = None

    for file in files:
        with pd.read_csv(file, iterator=True) as reader:
            lines_in_file = 0
            if leftover_chunk is not None:
                LOG.debug(
                    f"Combine leftover_chunk of size {leftover_chunk.shape[0]} "
                    "with chunk from new file of size "
                    f"{chunksize - leftover_chunk.shape[0]}"
                )
                chunk = pd.concat(
                    [
                        leftover_chunk,
                        reader.get_chunk(chunksize - leftover_chunk.shape[0]),
                    ]
                )
                lines_in_file += chunksize - leftover_chunk.shape[0]
                leftover_chunk = None
                LOG.debug(f"Yield combined chunk of size {chunk.shape[0]}")
                yield chunk

            while leftover_chunk is None:
                chunk = reader.get_chunk(chunksize)
                lines_in_file += chunk.shape[0]
                if chunk.shape[0] < chunksize:
                    leftover_chunk = chunk
                    LOG.debug(
                        f"Leftover chunk of size {leftover_chunk.shape[0]}"
                    )
                else:
                    LOG.debug(f"Yield regular chunk of size {chunk.shape[0]}")
                    yield chunk

            LOG.debug(f"Processed {lines_in_file} of file {file}")

    LOG.warning(f"Yield leftover_chunk of size {leftover_chunk.shape[0]}")
    yield leftover_chunk


for chunk in chunk_from_files(files, chunksize=1e6):
    print(chunk.shape[0])

This will return as much equally sized chunks as possible. Split chunks will be combined with chunks from the next file. Note, chunksize has to be smaller than lines in a single file.

Note, above is tested with pd.__version__ '2.2.2'.

Contexture answered 9/10 at 8:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.