Writing large Pandas Dataframes to CSV file in chunks
Asked Answered
B

3

46

How do I write out a large data files to a CSV file in chunks?

I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me.

I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.

I've been looking into reading large data files in chunks into a dataframe. However, I haven't been able to find anything on how to write out the data to a csv file in chunks.

Here is what I'm trying now, but this doesn't append the csv file:

with open(os.path.join(folder, filename), 'r') as src:
    chunks = pd.read_csv(src, sep='\t',skiprows=(0,1,2),header=(0), chunksize=1000)
    for chunk in chunks:
        chunk.to_csv(
            os.path.join(folder, new_folder, "new_file_" + filename), 
            columns = [['TIME','STUFF']]
        )
Brassware answered 22/7, 2016 at 16:20 Comment(0)
C
60

Solution:

header = True
for chunk in chunks:

    chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename),
        header=header, cols=[['TIME','STUFF']], mode='a')

    header = False

Notes:

  • The mode='a' tells pandas to append.
  • We only write a column header on the first chunk.
Cautery answered 22/7, 2016 at 16:27 Comment(4)
I've noticed that when I append using mode='a', the column labels are written after every chunk. How do I make sure column labels only appear at the beginning of the file?Brassware
You can pass header=None to all but the first chunkMonique
You could do for i, chunk in chunks:, and then header=(i==0)Adeline
I wound up using this solution with this other solution to split a dataframe into chunks first: #17316237Merriemerrielle
B
23

Check out the chunksize argument in the to_csv method. Here are the docs.

Writing to file would look like:

df.to_csv("path/to/save/file.csv", chunksize=1000, cols=['TIME','STUFF'])
Belia answered 22/7, 2016 at 16:27 Comment(5)
Hmm I got the following error using your proposed method: AttributeError: 'TextFileReader' object has no attribute 'to_csv' Your answer is still assuming I'm reading into "df" in chunks?Brassware
This is for a complete DataFrame.Belia
this is not helpful when streaming a giant dataframe from one file to another, in that case mode='a' is better.Dihybrid
@denfromufa Is that for sure? chunksize could mean writing in batches, could it not? And then it would have to be done in append mode anyway. Or am I missing something? I do not know the technical details, though, just a guess. Has anybody more insight into this, is this here the same as the accepted answer with its loop?Rodolphe
I can assure that this worked on a 50 MB file on 700000 rows with chunksize 5000 many times faster than a normal csv writer that loops over batches. I have not checked the loop over dataframes in append mode as in the accepted answer, but this answer cannot be bad at least. Brought down the Cloud Function time down to 62s from >9min timeout limit before (I do not even know how long it would have taken for writing all data, but much longer, obviously).Rodolphe
A
1

Why don't you only read the columns of interest and then save it?

file_in = os.path.join(folder, filename)
file_out = os.path.join(folder, new_folder, 'new_file' + filename)

df = pd.read_csv(file_in, sep='\t', skiprows=(0, 1, 2), header=0, names=['TIME', 'STUFF'])
df.to_csv(file_out)
Additive answered 22/7, 2016 at 17:11 Comment(2)
Just in case I came across files that were so big that I would have to read in as chunks. I don't believe your code would allow me to do that, correct?Brassware
Correct, but it is still much more efficient. If that were the case, you would still need to chunk or else use the csv module.Additive

© 2022 - 2024 — McMap. All rights reserved.