Writing large Pandas Dataframes to CSV file in chunks

Asked 22/7, 2016 at 16:20 Answered 22/7, 2016 at 17:11

Solved python pandas dataframe export-to-csv large-data

How do I write out a large data files to a CSV file in chunks?

I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me.

I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.

I've been looking into reading large data files in chunks into a dataframe. However, I haven't been able to find anything on how to write out the data to a csv file in chunks.

Here is what I'm trying now, but this doesn't append the csv file:

with open(os.path.join(folder, filename), 'r') as src:
    chunks = pd.read_csv(src, sep='\t',skiprows=(0,1,2),header=(0), chunksize=1000)
    for chunk in chunks:
        chunk.to_csv(
            os.path.join(folder, new_folder, "new_file_" + filename), 
            columns = [['TIME','STUFF']]
        )

Brassware answered 22/7, 2016 at 16:20 Comment(0)

Solution:

header = True
for chunk in chunks:

    chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename),
        header=header, cols=[['TIME','STUFF']], mode='a')

    header = False

Notes:

The mode='a' tells pandas to append.
We only write a column header on the first chunk.

Cautery answered 22/7, 2016 at 16:27 Comment(4)

I've noticed that when I append using mode='a', the column labels are written after every chunk. How do I make sure column labels only appear at the beginning of the file? – Brassware 22/7, 2016 at 20:0

You can pass header=None to all but the first chunk – Monique 1/6, 2017 at 18:46

You could do for i, chunk in chunks:, and then header=(i==0) – Adeline 21/12, 2020 at 2:31

I wound up using this solution with this other solution to split a dataframe into chunks first: #17316237 – Merriemerrielle 12/5, 2023 at 21:35

Check out the chunksize argument in the to_csv method. Here are the docs.

Writing to file would look like:

df.to_csv("path/to/save/file.csv", chunksize=1000, cols=['TIME','STUFF'])

Belia answered 22/7, 2016 at 16:27 Comment(5)

Hmm I got the following error using your proposed method: AttributeError: 'TextFileReader' object has no attribute 'to_csv' Your answer is still assuming I'm reading into "df" in chunks? – Brassware 22/7, 2016 at 16:46

This is for a complete DataFrame. – Belia 22/7, 2016 at 17:7

this is not helpful when streaming a giant dataframe from one file to another, in that case mode='a' is better. – Dihybrid 9/11, 2017 at 20:21

@denfromufa Is that for sure? chunksize could mean writing in batches, could it not? And then it would have to be done in append mode anyway. Or am I missing something? I do not know the technical details, though, just a guess. Has anybody more insight into this, is this here the same as the accepted answer with its loop? – Rodolphe 31/1, 2022 at 16:44

I can assure that this worked on a 50 MB file on 700000 rows with chunksize 5000 many times faster than a normal csv writer that loops over batches. I have not checked the loop over dataframes in append mode as in the accepted answer, but this answer cannot be bad at least. Brought down the Cloud Function time down to 62s from >9min timeout limit before (I do not even know how long it would have taken for writing all data, but much longer, obviously). – Rodolphe 1/2, 2022 at 14:16

Why don't you only read the columns of interest and then save it?

file_in = os.path.join(folder, filename)
file_out = os.path.join(folder, new_folder, 'new_file' + filename)

df = pd.read_csv(file_in, sep='\t', skiprows=(0, 1, 2), header=0, names=['TIME', 'STUFF'])
df.to_csv(file_out)

Additive answered 22/7, 2016 at 17:11 Comment(2)

Just in case I came across files that were so big that I would have to read in as chunks. I don't believe your code would allow me to do that, correct? – Brassware 22/7, 2016 at 19:59

Correct, but it is still much more efficient. If that were the case, you would still need to chunk or else use the csv module. – Additive 22/7, 2016 at 20:3

Recommended topics

Hot tags