Pandas msgpack vs pickle
Asked Answered
C

1

26

msgpack in Pandas is supposed to be a replacement for pickle.

Per the Pandas docs on msgpack:

This is a lightweight portable binary format, similar to binary JSON, that is highly space efficient, and provides good performance both on the writing (serialization), and reading (deserialization).

I find, however, that its performance does not appear to stack up against pickle.

df = pd.DataFrame(np.random.randn(10000, 100))

>>> %timeit df.to_pickle('test.p')
10 loops, best of 3: 22.4 ms per loop

>>> %timeit df.to_msgpack('test.msg')
10 loops, best of 3: 36.4 ms per loop

>>> %timeit pd.read_pickle('test.p')
100 loops, best of 3: 10.5 ms per loop

>>> %timeit pd.read_msgpack('test.msg')
10 loops, best of 3: 24.6 ms per loop

Question: Asides from potential security issues with pickle, what are the benefits of msgpack over pickle? Is pickle still the preferred method of serializing data, or do better alternatives currently exist?

Chas answered 4/6, 2015 at 18:43 Comment(5)
checkout this pretty comprehensive study: matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization. msgpack is quite awesome when you have a non-trivial amount of data.Godchild
that blog page no longer existsHawker
@JasonS Try: matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization. Oddly, it is just a trailing / in the top link.Chas
I wrote msgpickle. It's a fast, safe pickler that's easily extendible. Combines msgpack with a simple object picklerBausch
@ErikAronesty Thanks for mentioning it. For others, here is the link on PyPI. It requires Python >= 3.11Chas
L
32

Pickle is better for the following:

  1. Numerical data or anything that uses the buffer protocol (numpy arrays) (though only if you use a somewhat recent protocol=)
  2. Python specific objects like classes, functions, etc.. (although here you should look at cloudpickle)

MsgPack is better for the following:

  1. Cross language interoperation. It's an alternative to JSON with some improvements
  2. Performance on text data and Python objects. It's a decent factor faster than Pickle at this under any setting.

As @Jeff noted above this blogpost may be of interest

Lanthorn answered 4/6, 2015 at 19:26 Comment(6)
am I right to say that 4D panels are not supported for pickle too?Haletta
@firefly: in pandas 0.18 Panel4D is pickleable. But consider xarray instead of Panel4D.Static
to_msgpack() crashes for bigger size data, I am not sure about to_pickle()Hey
what do you mean by "crashes for bigger size data" - can you expand on that?Prolongation
As of 2019, MsgPack support for Pandas has been deprecated, and the recommendation is to use pyarrow instead.Kurtis
i made my own "msgpickle" lib that can handle objectsBausch

© 2022 - 2024 — McMap. All rights reserved.