What is the fastest way to upload a big csv file in notebook to work with python pandas?

Asked 3/5, 2016 at 17:6 Answered 8/7, 2024 at 21:28

I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:

location = r'C:\Users\Name\Folder_1\Folder_2\file.csv'
df = pd.read_csv(location)

This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options

I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.

Genovevagenre answered 3/5, 2016 at 17:6 Comment(10)

It would help to understand the structure of your data, sometimes specifying upfront the dtypes helps with read_csv – Lapides 3/5, 2016 at 17:12

This appears to be a duplicate. – Acceptation 3/5, 2016 at 17:17

On hdf5 vs .read_csv() speed and space needs, see #16640377 – Gordan 3/5, 2016 at 17:19

@SteveCohen there's not an answer that compares 'pickle', hdf5, 'feather', 'chunking' in one example. It would be a good reference question to have in the site. – Genovevagenre 3/5, 2016 at 17:21

With the exception of feather the others appear to be covered. Be that as it may, I suspect your problem may be deeper than Python. Is the location actually copying over a network which may be throttling the data? Or is the machine running this code memory starved and, thus, swapping / thrashing? It seems unreasonable for any of those methods to take 20 minutes to load 250MB of data. – Acceptation 3/5, 2016 at 17:28

i would consider only HDF5 and feather (it might be tricky and painful to build it for Windows OS) - the rest will most definitely be much slower – Sarcocarp 3/5, 2016 at 17:44

Don't forget about Blaze. blaze.readthedocs.io/en/latest/index.html – Histiocyte 3/5, 2016 at 17:54

Is there a reason you can't gzip it? – Kowtow 3/5, 2016 at 18:47

@KenKinder, i've added GZIP to my comparison – Sarcocarp 3/5, 2016 at 20:30

There are other alternatives: Parquet, asdf, TileDB... – Lineal 19/3, 2019 at 19:21

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

reading

writing/saving

file size ratio in relation to uncompressed CSV file

RAW DATA:

CSV:

In [68]: %timeit df.to_csv(fcsv)
1 loop, best of 3: 1min 9s per loop

In [74]: %timeit pd.read_csv(fcsv)
1 loop, best of 3: 17.9 s per loop

CSV.gzip:

In [70]: %timeit df.to_csv(fcsv_gz, compression='gzip')
1 loop, best of 3: 3min 6s per loop

In [75]: %timeit pd.read_csv(fcsv_gz)
1 loop, best of 3: 18.9 s per loop

Pickle:

In [66]: %timeit df.to_pickle(fpckl)
1 loop, best of 3: 1.77 s per loop

In [72]: %timeit pd.read_pickle(fpckl)
10 loops, best of 3: 173 ms per loop

HDF (format='fixed') [Default]:

In [67]: %timeit df.to_hdf(fh5, 'df')
1 loop, best of 3: 2.03 s per loop

In [73]: %timeit pd.read_hdf(fh5, 'df')
10 loops, best of 3: 196 ms per loop

HDF (format='table'):

In [37]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df', format='t')
1 loop, best of 3: 2.6 s per loop

In [38]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab.h5', 'df')
1 loop, best of 3: 230 ms per loop

HDF (format='table', complib='zlib', complevel=5):

In [40]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df', format='t', complevel=5, complib='zlib')
1 loop, best of 3: 5.44 s per loop

In [41]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib5.h5', 'df')
1 loop, best of 3: 854 ms per loop

HDF (format='table', complib='zlib', complevel=9):

In [36]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df', format='t', complevel=9, complib='zlib')
1 loop, best of 3: 5.95 s per loop

In [39]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_zlib9.h5', 'df')
1 loop, best of 3: 860 ms per loop

HDF (format='table', complib='bzip2', complevel=5):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df', format='t', complevel=5, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l5.h5', 'df')
1 loop, best of 3: 2.5 s per loop

HDF (format='table', complib='bzip2', complevel=9):

In [42]: %timeit df.to_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df', format='t', complevel=9, complib='bzip2')
1 loop, best of 3: 36.5 s per loop

In [43]: %timeit pd.read_hdf('D:\\temp\\.data\\37010212_tab_compress_bzip2_l9.h5', 'df')
1 loop, best of 3: 2.5 s per loop

PS i can't test feather on my Windows notebook

DF info:

In [49]: df.shape
Out[49]: (4000000, 6)

In [50]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 6 columns):
a    datetime64[ns]
b    datetime64[ns]
c    datetime64[ns]
d    datetime64[ns]
e    datetime64[ns]
f    datetime64[ns]
dtypes: datetime64[ns](6)
memory usage: 183.1 MB

In [41]: df.head()
Out[41]:
                    a                   b                   c  \
0 1970-01-01 00:00:00 1970-01-01 00:01:00 1970-01-01 00:02:00
1 1970-01-01 00:01:00 1970-01-01 00:02:00 1970-01-01 00:03:00
2 1970-01-01 00:02:00 1970-01-01 00:03:00 1970-01-01 00:04:00
3 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
4 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00

                    d                   e                   f
0 1970-01-01 00:03:00 1970-01-01 00:04:00 1970-01-01 00:05:00
1 1970-01-01 00:04:00 1970-01-01 00:05:00 1970-01-01 00:06:00
2 1970-01-01 00:05:00 1970-01-01 00:06:00 1970-01-01 00:07:00
3 1970-01-01 00:06:00 1970-01-01 00:07:00 1970-01-01 00:08:00
4 1970-01-01 00:07:00 1970-01-01 00:08:00 1970-01-01 00:09:00

File sizes:

{ .data }  » ls -lh 37010212.*                                                                          /d/temp/.data
-rw-r--r-- 1 Max None 492M May  3 22:21 37010212.csv
-rw-r--r-- 1 Max None  23M May  3 22:19 37010212.csv.gz
-rw-r--r-- 1 Max None 214M May  3 22:02 37010212.h5
-rw-r--r-- 1 Max None 184M May  3 22:02 37010212.pickle
-rw-r--r-- 1 Max None 215M May  4 10:39 37010212_tab.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:46 37010212_tab_compress_bzip2_l5.h5
-rw-r--r-- 1 Max None 5.4M May  4 10:51 37010212_tab_compress_bzip2_l9.h5
-rw-r--r-- 1 Max None  17M May  4 10:42 37010212_tab_compress_zlib5.h5
-rw-r--r-- 1 Max None  17M May  4 10:36 37010212_tab_compress_zlib9.h5

Conclusion:

Pickle and HDF5 are much faster, but HDF5 is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look at where parameter in read_hdf()), you can also store your data compressed (zlib - is faster, bzip2 - provides better compression ratio), etc.

PS if you can build/use feather-format - it should be even faster compared to HDF5 and Pickle

PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.

Sarcocarp answered 3/5, 2016 at 18:45 Comment(3)

You might want to try 'cPickle.dump(..., protocol=2)' as in matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization In my tests it was 3x faster than HDF fixed format. – Eun 21/11, 2016 at 23:6

Read and write time for ascii tables are misleading. There are much faster ways to read/write ascii files when a few reasonable assumptions are imposed. See PyArrow or Astropy Fast ASCII, and others. – Kelleykelli 8/7, 2022 at 20:13

One can store a subclass of DataFrame or Series to HDF5, but the type of the subclass is lost upon storing. – Cockrell 9/6, 2023 at 13:39

you can use datatable to load the csv file and then covert it to pandas dataframe, it is considerably faster than loading with pandas read_csv()

import datatable as dt

dt_df = dt.fread(csv_file)
pd_df = dt_df.to_pandas()

On csv file of 1 Go, pandas read_csv take about 34 minutes, while datable fread take only 40 second, which is a huge difference (x51 faster).

You can also work only with datatable dataframe, without the need to convert to pandas dataframe (this depends on the functionality that you want). A useful introduction to datatable : https://www.kaggle.com/sudalairajkumar/getting-started-with-python-datatable

Kingsly answered 22/7, 2020 at 12:9 Comment(6)

hm. I just tried this and its still not as fast as reading each CSV line one row at a time. I have one with 11000 lines with a dozen or so columns. My raw file read benchmarks at 0.16 seconds vs. this datatable route which benchmarks at 9.7 seconds. I don't get it. If CSV is this fast, what am I missing regarding other storage schemes. – Symbology 29/10, 2020 at 6:14

it's depend on 2 things, 1) how is your data, and 2) what you will do after reading it. 1) you can do it read CSV file manually, but but, you have to deal with conversion of type depending on columns, and deal with empty entry … and store the result some where in matrix for example. 2) if you want to manipulate your data with panda datafarme or (datatable), so you have to convert the data and this is also time to be consider. By the way there is CSV library in python you can use theme. So, if CSV library is enough go for it ;) , good luck. – Kingsly 29/10, 2020 at 13:40

Maybe that's the most important point in all of this. If line by line read is in fact all I need rather than in program representation of the data cached in memory, then I can get by with that. Maybe I should avoid HDF5 and other solutions unless I need an in-memory cache of the data. Is it correct to assume part of the overhead I am experiencing is due to the fact that other solutions are preparing the data to be stored in data space of the program? – Symbology 29/10, 2020 at 15:54

Still I feel like I am missing something about HDF5 and other hierarchical binary format solutions. If I can use slicing and chunking to improve my indexing criteria (let's say for a particular time period within the data), shouldn't I be able to iterate over the binary file as fast as CSV file? (i.e., avoid the loading into memory step). Because ultimately, I would improve upon CSV with my own binary format of the data with own serializer/deserializer code. It would provide better storage footprint and make use of slicing by using well defined fs - but that is reinventing the wheel with hdf5. – Symbology 29/10, 2020 at 15:59

@LeanMan, I think your missing point is: mix thing between 1) only read csv file (and you have to deal with type conversion) and 2) library that are for data manipulation (statistics and analysis ) like pandas and others, these library comes with their own csv read, where they can guess the types. dask is in the second category, is a collection of smaller pandas for big data (Dask provides advanced parallelism for analytics). – Kingsly 1/11, 2020 at 15:16

I do intend to do statistics and analysis on my data but since most of the analysis is moving averages it doesn't require the "load to memory" time that pandas/hdf5 does. I can get by with reading the latest data at a time just like as time is moving and data is collected at the end of each period. – Symbology 2/11, 2020 at 16:37

If you are really looking for a quick solution, I'm surprised no one has suggested chunking the data and read as a batch - have not worked on it recently, but in my previous job, I had to work through a file with ~30M rows pretty frequently. I would just use chunksize parameter while calling read_csv and append it to form the main dataframe. You would also add iterator = True in the read_csv params list.

It never took more than 1 min to read for me. (Though it's dependent on chunksize and system config)

df_csv = pd.read_csv('file', sep = '\t', iterator = True, chunksize = 1000) 
df = pd.concat(chunk for chunk in offer_csv])

Additionally, you could also filter out data in each chunk in that's what you want to do next , makes the life lot easier.

df = pd.concat(chunk[chunk['col'] > 100] for chunk in offer_csv])

Magnanimous answered 8/7, 2024 at 21:28 Comment(0)

Recommended topics

Hot tags