How to reversibly store and load a Pandas dataframe to/from disk

Asked 13/6, 2013 at 23:5 Answered 11/3, 2021 at 10:34

470

Right now I'm importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?

Pastis answered 13/6, 2013 at 23:5 Comment(1)

Yea, this is one of my major complaints using Python - there's no simple way to save & retrieve data frames. R and SAS are far more user friendly in this respect. – Metalware 15/1, 2019 at 15:24

682

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).

Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

import pandas as pd
store = pd.HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

More advanced strategies are discussed in the cookbook.

Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).

Lenette answered 13/6, 2013 at 23:13 Comment(11)

@geekazoid In case the data needs to be transformed after loading (i.e. string/object to datetime64) this would need to be done again after loading a saved csv, resulting in performance loss. pickle saves the dataframe in it's current state thus the data and its format is preserved. This can lead to massive performance increases. – Energize 14/10, 2015 at 12:16

Both pickle and HDFStore cannot save dataframe more than 8GB. Are there alternatives? – Rimmer 8/11, 2017 at 15:47

@Rimmer are you sure? That sounds inplausable. – Lenette 8/11, 2017 at 15:50

I just tried it, I am not sure how I can post minimal reproducible example. My dataframe had column for text and another column for vector representation (numpy.array) of text. Roughly 0.5 mil rows and 8GB from sys.getsizeof. – Rimmer 8/11, 2017 at 15:57

@Rimmer try to generate from random data (text and arrays) and post a new question. I don't think this can be right/suspect we're missing something. New question will get more eyes, but try to include/generate a DataFrame that reproduces :) – Lenette 8/11, 2017 at 17:12

What's the file permission mode of files created by to_pickle? This function seems to not take mode as an argument. – Goody 19/6, 2018 at 16:28

@YixingLiu you can change the mode after the fact https://mcmap.net/q/81249/-changing-file-permission-in-python – Lenette 19/6, 2018 at 18:27

If you try to read a pickle file that was saved by a former version of pandas, it will produce ImportError: No module named 'pandas.core.internals.managers'; 'pandas.core.internals' is not a package. I reverted to an earlier version (23.0) as recommended by another thread but that did not solve the problem. – Growler 30/10, 2019 at 16:38

Check this from pandas documentation: pandas.pydata.org/pandas-docs/stable/user_guide/io.html They keep updating with latest benchmarks from pandas dev team. – Dunkin 13/12, 2019 at 10:13

Just an update for this reply. The HDFStore method is renamed to to_hdf since Pandas v1.2. – Upcast 15/6, 2021 at 9:23

This raised an error TypeError: 'method' object does not support item assignment. – Maisiemaison 20/11, 2023 at 13:15

140

Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.

They compare:

pickle: original ASCII data format
cPickle, a C library
pickle-p2: uses the newer binary format
json: standardlib json library
json-no-index: like json, but without index
msgpack: binary JSON alternative
CSV
hdfstore: HDF5 storage format

In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:

You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself

The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.py I got the following results:

They also mention that with the conversion of text data to categorical data the serialization is much faster. In their test about 10 times as fast (also see the test code).

Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickle uses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.

Some other references:

In the question Fastest Python library to read a CSV file there is a very detailed answer which compares different libraries to read csv files with a benchmark. The result is that for reading csv files numpy.fromfile is the fastest.
Another serialization test shows msgpack, ujson, and cPickle to be the quickest in serializing.

Richrichara answered 6/11, 2015 at 15:24 Comment(8)

Hm, if i read this correctly, it says that pickle, a binary format, is slower than csv for numeric data... Sizes of the files of the .csv for the same data should be larger, right? Or, in the .py I see this cost is included? (but again, what was the cache... Given a large enough cache everything might have gone to the memory...) – Payload 1/12, 2015 at 11:6

I updated my answer to explain your question. To summarize: by default pickle stores data in an ASCII format. – Richrichara 4/12, 2015 at 14:28

Ah, thanx for that explanation! As a note, pandas DataFrame .to_pickle seems to be using the pkl.HIGHEST_PROTOCOL (should be 2) – Payload 4/12, 2015 at 16:3

It seems the blog linked above (Efficiently Store Pandas DataFrames has been deleted. I did my own comparisons with .to_pickle() (which uses binary storage) against .to_hdf() (without compression). The goal was speed, file size for HDF was 11x Pickle, and time to load was 5x Pickle. My data was ~5k files of ~7k rows x 6 cols each, mostly numeric. – Operetta 24/3, 2016 at 17:59

The page still exists, you just need to remove the trailing slash: Efficiently Store Pandas DataFrames – Phenyl 21/9, 2017 at 11:12

@Operetta Are you saying that HDF is faster or slower than pickle? I need speed at the moment, and everything I've read suggests HDF is faster, including this answer. But you're saying it is 11x, which I'm interpreting as HDF is taking 11x longer. Or do you mean it's 11x faster (taking 1/11th the time pickle does)?? – Gullett 20/6, 2018 at 0:0

@Mike Williamson, in my test, pickle was 5x faster to load than HDF and also took 1/11 the disk space (ie hdf was 11x larger on disk and took 5x As much time to load from disk as pickle did). this was all on python 3 with pandas 0.22.0. – Operetta 25/6, 2018 at 2:3

@Operetta Thanks for clarification! That's what I thought you meant, but it seems against the whole idea of HDF5 as this super-optimized format data storage format. :( – Gullett 26/6, 2018 at 3:34

If I understand correctly, you're already using pandas.read_csv() but would like to speed up the development process so that you don't have to load the file in every time you edit your script, is that right? I have a few recommendations:

you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you're doing the development
use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.
convert the csv to an HDF5 table
updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).

You might also be interested in this answer on stackoverflow.

Stinker answered 13/6, 2013 at 23:28 Comment(2)

Do you know why to_feather would work well on string data? I benchmarked to_pickle and to_feature on my numeric dataframe and pickle is about 3x faster. – Flagon 7/7, 2018 at 14:52

@Flagon good question, I honestly haven't played much with the feather stuff, so I don't have an answer – Stinker 8/7, 2018 at 18:52

Pickle works good!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

Tetratomic answered 10/10, 2015 at 4:52 Comment(1)

Note that the files generated are not csv files, maybe it's better to use the extension .pkl as suggested in @Andy Haydens answer. – Richrichara 6/11, 2015 at 8:23

You can use feather format file. It is extremely fast.

df.to_feather('filename.ft')

Gradate answered 17/10, 2017 at 6:28 Comment(1)

And the data can then be used directly by R using the feather library. – Aurangzeb 28/1, 2020 at 23:16

As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickle is not a first-class citizen (depending on your setup), because:

pickle is a potential security risk. Form the Python documentation for pickle:

Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

pickle is slow. Find here and here benchmarks.
pickle is Python-only. You cannot simply read the results with other tools or programming languages.

Depending on your setup/usage both limitations do not apply, but I would not recommend pickle as the default persistence for pandas data frames.

Camarillo answered 11/4, 2019 at 6:31 Comment(0)

Pandas DataFrames have the to_pickle function which is useful for saving a DataFrame:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

Rodrigues answered 12/11, 2015 at 21:46 Comment(0)

Numpy file formats are pretty fast for numerical data

I prefer to use numpy files since they're fast and easy to work with. Here's a simple benchmark for saving and loading a dataframe with 1 column of 1million points.

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

using ipython's %%timeit magic function

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

the output is

100 loops, best of 3: 5.97 ms per loop

to load the data back into a dataframe

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

the output is

100 loops, best of 3: 5.12 ms per loop

NOT BAD!

CONS

There's a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).

Generator answered 13/10, 2017 at 18:25 Comment(1)

note that this solution will delete all of your column names and change all of your integer data to float :( – Arronarrondissement 19/1, 2018 at 18:56

Another quite fresh test with to_pickle().

I have 25 .csv files in total to process and the final dataframe consists of roughly 2M items.

(Note: Besides loading the .csv files, I also manipulate some data and extend the data frame by new columns.)

Going through all 25 .csv files and create the dataframe takes around 14 sec.

Loading the whole dataframe from a pkl file takes less than 1 sec

Cimex answered 13/8, 2020 at 7:52 Comment(0)

https://docs.python.org/3/library/pickle.html

The pickle protocol formats:

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.

Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.

Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.

Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.

Oxysalt answered 12/4, 2019 at 6:18 Comment(0)

Arctic is a high performance datastore for Pandas, numpy and other numeric data. It sits on top of MongoDB. Perhaps overkill for the OP, but worth mentioning for other folks stumbling across this post

Arevalo answered 26/9, 2020 at 9:47 Comment(0)

pyarrow compatibility across versions

Overall move has been to pyarrow/feather (deprecation warnings from pandas/msgpack). However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0.15.1 cannot be deserialized with 0.16.0 ARROW-7961. I'm using serialization to use redis so have to use a binary encoding.

I've retested various options (using jupyter notebook)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

With following results for my data frame (in out jupyter variable)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

feather and parquet do not work for my data frame. I'm going to continue using pyarrow. However I will supplement with pickle (no compression). When writing to cache store pyarrow and pickle serialised forms. When reading from cache fallback to pickle if pyarrow deserialisation fails.

Phenoxide answered 29/2, 2020 at 14:3 Comment(1)

This doesn't answer the question – Independence 28/5, 2020 at 22:47

-1

import pickle

example_dict = {1:"6",2:"2",3:"g"}

pickle_out = open("dict.pickle","wb")
pickle.dump(example_dict, pickle_out)
pickle_out.close()

The above code will save the pickle file

pickle_in = open("dict.pickle","rb")
example_dict = pickle.load(pickle_in)

This two lines will open the saved pickle file

Stinson answered 16/1, 2018 at 17:20 Comment(0)

-1

A lot of great and sufficient answers here, but I would like to publish a test that I used on Kaggle, which large df is saved and read by different pandas compatible formats:

https://www.kaggle.com/pedrocouto39/fast-reading-w-pickle-feather-parquet-jay

I'm not the author or friend of author of this, hovewer, when I read this question I think it's worth mentioning there.

CSV: 1min 42s Pickle: 4.45s Feather: 4.35s Parquet: 8.31s Jay: 8.12ms or 0.0812s (blazing fast!)

Twosided answered 11/3, 2021 at 10:34 Comment(1)

Careful! That comparison is not fair! The Jay file is read as a datatable Frame instead of a pandas DataFrame. Datatable supports out-of-memory datasets and I suspect that the data is not actually read yet. – Peralta 8/12, 2021 at 13:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Numpy file formats are pretty fast for numerical data

CONS

pyarrow compatibility across versions

Recommended topics

Hot tags