What is the fastest way to output large DataFrame into a CSV file?
Asked Answered
T

5

39

For python / pandas I find that df.to_csv(fname) works at a speed of ~1 mln rows per min. I can sometimes improve performance by a factor of 7 like this:

def df2csv(df,fname,myformats=[],sep=','):
  """
    # function is faster than to_csv
    # 7 times faster for numbers if formats are specified, 
    # 2 times faster for strings.
    # Note - be careful. It doesn't add quotes and doesn't check
    # for quotes or separators inside elements
    # We've seen output time going down from 45 min to 6 min 
    # on a simple numeric 4-col dataframe with 45 million rows.
  """
  if len(df.columns) <= 0:
    return
  Nd = len(df.columns)
  Nd_1 = Nd - 1
  formats = myformats[:] # take a copy to modify it
  Nf = len(formats)
  # make sure we have formats for all columns
  if Nf < Nd:
    for ii in range(Nf,Nd):
      coltype = df[df.columns[ii]].dtype
      ff = '%s'
      if coltype == np.int64:
        ff = '%d'
      elif coltype == np.float64:
        ff = '%f'
      formats.append(ff)
  fh=open(fname,'w')
  fh.write(','.join(df.columns) + '\n')
  for row in df.itertuples(index=False):
    ss = ''
    for ii in xrange(Nd):
      ss += formats[ii] % row[ii]
      if ii < Nd_1:
        ss += sep
    fh.write(ss+'\n')
  fh.close()

aa=DataFrame({'A':range(1000000)})
aa['B'] = aa.A + 1.0
aa['C'] = aa.A + 2.0
aa['D'] = aa.A + 3.0

timeit -r1 -n1 aa.to_csv('junk1')    # 52.9 sec
timeit -r1 -n1 df2csv(aa,'junk3',myformats=['%d','%.1f','%.1f','%.1f']) #  7.5 sec

Note: the increase in performance depends on dtypes. But it is always true (at least in my tests) that to_csv() performs much slower than non-optimized python.

If I have a 45 million rows csv file, then:

aa = read_csv(infile)  #  1.5 min
aa.to_csv(outfile)     # 45 min
df2csv(aa,...)         # ~6 min

Questions:

What are the ways to make the output even faster?
What's wrong with to_csv() ? Why is it soooo slow ?

Note: my tests were done using pandas 0.9.1 on a local drive on a Linux server.

Theresita answered 14/3, 2013 at 18:40 Comment(3)
I just recently used the built-in to_excel and then to_csv DataFrame methods to export about 1.7K reports in a batch from a DataFrame groupby, and a portion of the reports (in particular, the larger files) came out corrupted. I'm now rather suspicious of these built-in procedures and plan to home-brew my own exporting functions for my workflow.Seduction
@DavidMarx would you be able to post an example of the corrupted file, the DataFrame, and your code? It would help us immensely to debug the issue. Thanks.Cavill
It's in a workhorse tool I built for my office. I'll see if I can't duplicate the issue with data that isn't workplace-sensitive.Seduction
S
15

Lev. Pandas has rewritten to_csv to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms, lower ratio is better.

Results:
                                            t_head  t_baseline      ratio
name                                                                     
frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849
write_csv_standard  (10k rows)             38.1940    234.2570     0.1630
frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286
frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951

So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.

In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype='float64')}) 
In [13]: df['B'] = df['A'] + 1.0   
In [14]: df['C'] = df['A'] + 2.0
In [15]: df['D'] = df['A'] + 2.0
In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv')
1 loops, best of 1: 119 s per loop
Steger answered 19/3, 2013 at 16:38 Comment(3)
FWIW for this type of dataset it often pays to store in HDF5, esp if you have any need to query, seee pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables, and pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstoreSteger
I am trying to output a 100,000 * 100,000 dataframe and it's taking painstakingly long... =(Cote
I combined 90+ csv files , each file pick specific 4 columns, , only 3 files are over 200,000KB,other files under avg 10,000KB, and some files are zero rows, then to_csv, it takes 17 minutes. The results csv is 200,000KB, does this time make sense? How can I optimize?@StegerEmir
I
12

In 2019 for cases like this, it may be better to just use numpy. Look at the timings:

aa.to_csv('pandas_to_csv', index=False)
# 6.47 s

df2csv(aa,'code_from_question', myformats=['%d','%.1f','%.1f','%.1f'])
# 4.59 s

from numpy import savetxt

savetxt(
    'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f',
    header=','.join(aa.columns), comments=''
)
# 3.5 s

So you can cut the time by a factor of two using numpy. This, of course, comes at a cost of reduced flexibility (when compared to aa.to_csv).

Benchmarked with Python 3.7, pandas 0.23.4, numpy 1.15.2 (xrange was replaced by range to make the posted function from the question work in Python 3).

PS. If you need to include the index, savetxt will work fine - just pass df.reset_index().values and adjust the formatting string accordingly.

2021 update: as pointed in the comments the pandas performance improved greatly. savetxt is still the fastest option, but only by a narrow margin: when benchmarked with pandas 1.3.0 and numpy 1.20.3, aa.to_csv() took 2.64 s while savetxt 2.53 s. The code from the question (df2csv) took 2.98 s making it the slowest option nowadays.

Your mileage may vary - the 2021 test was performed on SSD with a very fast CPU, while in 2019 I was using HDD and a slower CPU.

Inception answered 10/2, 2019 at 15:19 Comment(4)
Im getting the following error: ValueError: fmt has wrong number of % formats: %d,%.1f,%.1f,%.1fOrthotropous
Make sure you got as much formatting strings as columns. The answer works out of the box with the test case from the question, for other dataframes it will require adjustments.Inception
For what it's worth, pandas to_csv isn't that far behind anymore (I'm on v 0.23.4): timeit("aa.to_csv('temp.csv', index=False)", globals=globals(), number=5) 37.9 s timeit("np.savetxt('temp.csv', aa.values, fmt='%d,%.1f,%.1f,%.1f', header=','.join(aa.columns), comments='')", globals=globals(), number=5) 33.7 sRawson
I updated the answer. I wonder how much of a difference does SSD vs HDD make.Inception
C
5

use chunksize. I have found that makes a hell lot of difference. If you have memory in hand use good chunksize (no of rows) to get into memory and then write once.

Cayes answered 13/1, 2016 at 11:59 Comment(4)
Is the default chunksize 1 or something? What chunksize do you use? It didn't seem to make a difference for me, even setting chunksize = 500000Burkey
Found it, it might be 1... github.com/pandas-dev/pandas/blob/v0.20.3/pandas/io/formats/…Burkey
@Burkey chunksize is only 1 if the number of columns > 100000, else it is 100000 // number of columnsSlug
Was trying to write 16 million rows to CSV. Used chunksize=1000000, which increased the write speed by a 10X factor. Using an c5.9xlarge EC2 instanceGuizot
T
4

I had the same question earlier today. Using to_csv took my dataframe 1hr 27min.

I found a package called pyarrow that reduced this to about 10min. This seemed like the most straight forward solution to me.

To use:

#install with conda then import
import pyarrow as pa
import pyarrow.csv as csv

#convert format - "old_pd_dataframe" is your "aa".
new_pa_dataframe = pa.Table.from_pandas(old_pd_dataframe)

#write csv
csv.write_csv(new_pa_dataframe, 'output.csv')
Twandatwang answered 11/10, 2022 at 23:39 Comment(0)
B
2

Your df_to_csv function is very nice, except it does a lot of assumptions and doesn't work for the general case.

If it works for you, that's good, but be aware that it is not a general solution. CSV can contain commas, so what happens if there is this tuple to be written? ('a,b','c')

The python csv module would quote that value so that no confusion arises, and would escape quotes if quotes are present in any of the values. Of course generating something that works in all cases is much slower. But I suppose you only have a bunch of numbers.

You could try this and see if it is faster:

#data is a tuple containing tuples

for row in data:
    for col in xrange(len(row)):
        f.write('%d' % row[col])
        if col < len(row)-1:
            f.write(',')
    f.write('\n')

I don't know if that would be faster. If not it's because too many system calls are done, so you might use StringIO instead of direct output and then dump it to a real file every once in a while.

Botanize answered 15/3, 2013 at 10:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.