Why is dill much faster and more disk-efficient than pickle for numpy arrays

C

2

11

I'm using Python 2.7 and NumPy 1.11.2, as well as the latest versions of dill ( I just did the pip install dill) , on Ubuntu 16.04.

When storing a NumPy array using pickle, I find that pickle is very slow, and stores arrays at almost three times the 'necessary' size.

For example, in the following code, pickle is approximately 50 times slower (1s versus 50s), and creates a file that is 2.2GB instead of 800MB.

 import numpy 
 import pickle
 import dill
 B=numpy.random.rand(10000,10000)
 with open('dill','wb') as fp:
    dill.dump(B,fp)
 with open('pickle','wb') as fp:
    pickle.dump(B,fp)

I thought dill was just a wrapper around pickle. If this is true, is there a way that I can improve the performance of pickle myself? Is it generally not advisable to use pickle for NumPy arrays?

EDIT: Using Python3, I get the same performance for pickle and dill

PS: I know about numpy.save, but I am working in a framework where I store lots of different objects, all residing in a dictionary, to a file.

Conchiferous answered 22/6, 2017 at 10:56 Comment(7)

Using python 3.6 and numpy 1.12.1 I get the same size for both files, can you try upgrading numpy – Dhiman 22/6, 2017 at 11:6

@Dhiman Upgrading to 1.13.0 does not change anything. Using Python3 does – Conchiferous 22/6, 2017 at 12:28

OK, I don't know the specific difference but it looks likely that this is some kind of optimisation that is only valid in python 3 which is weird – Dhiman 22/6, 2017 at 12:29

Do you know of any argument against using dill for numpy arrays? Otherwise I'd just go with that – Conchiferous 22/6, 2017 at 12:31

Not really, I've never used dill but it seems to be an extension of pickle plus you can save a session state so it should just work fine. – Dhiman 22/6, 2017 at 12:45

Python 2 has a faster cPIckle. That is the standard version in Python 3. – Yarborough 22/6, 2017 at 22:6

@Yarborough This doesn't do the trick. On my machine, using cPickle instead does not make any difference in runtime and memory consumption – Conchiferous 23/6, 2017 at 12:44

C

17

This ought to be a comment, but I have not enough reputation... My guess is that this is due to the pickle protocol used.

On Python 2, the default protocol is 0 and highest supported protocol is 2. On Python 3, the default protocol is 3 and highest supported protocol is 4 (as of Python 3.6).

Each protocol version improves on the previous one, but protocol 0 is especially slow for largish objects. It should be avoided in most cases, except if you need to be able to read your pickles using extremely old versions of Python. Protocol 2 is already much better.

Now, I suppose dill uses pickle.HIGHEST_PROTOCOL by default, and if that is indeed the case, it would probably be the cause of a good deal of the speed difference. You could try using pickle.HIGHEST_PROTOCOL to see if you get similar performance using dill and standard pickle.

with open('dill', 'wb') as fp:
    dill.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('pickle', 'wb') as fp:
    pickle.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)

Consignee answered 20/9, 2017 at 10:15 Comment(0)

R

22

I'm the dill author. dill is an extension of pickle, but it does add some alternate pickling methods for numpy and other objects. For example, dill leverages the numpy methods for the pickling of arrays.

Additionally, (I believe) dill uses DEFAULT_PROTOCOL by default (not HIGHEST_PROTOCOL), for python3, and for python2 it uses HIGHEST_PROTOCOL by default.

Rugged answered 20/9, 2017 at 14:54 Comment(0)

C

17

This ought to be a comment, but I have not enough reputation... My guess is that this is due to the pickle protocol used.

On Python 2, the default protocol is 0 and highest supported protocol is 2. On Python 3, the default protocol is 3 and highest supported protocol is 4 (as of Python 3.6).

Each protocol version improves on the previous one, but protocol 0 is especially slow for largish objects. It should be avoided in most cases, except if you need to be able to read your pickles using extremely old versions of Python. Protocol 2 is already much better.

Now, I suppose dill uses pickle.HIGHEST_PROTOCOL by default, and if that is indeed the case, it would probably be the cause of a good deal of the speed difference. You could try using pickle.HIGHEST_PROTOCOL to see if you get similar performance using dill and standard pickle.

with open('dill', 'wb') as fp:
    dill.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('pickle', 'wb') as fp:
    pickle.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)

Consignee answered 20/9, 2017 at 10:15 Comment(0)

Recommended topics

Hot tags