more efficient way to pickle a string

Asked 30/3, 2009 at 1:48 Answered 11/5, 2010 at 7:26

Solved python numpy pickle space-efficiency

The pickle module seems to use string escape characters when pickling; this becomes inefficient e.g. on numpy arrays. Consider the following

z = numpy.zeros(1000, numpy.uint8)
len(z.dumps())
len(cPickle.dumps(z.dumps()))

The lengths are 1133 characters and 4249 characters respectively.

z.dumps() reveals something like "\x00\x00" (actual zeros in string), but pickle seems to be using the string's repr() function, yielding "'\x00\x00'" (zeros being ascii zeros).

i.e. ("0" in z.dumps() == False) and ("0" in cPickle.dumps(z.dumps()) == True)

Longways answered 30/3, 2009 at 1:48 Comment(3)

You should add a specific question to your post here. – Sporophore 30/3, 2009 at 1:50

What do you want to serialize a Python string or a numpy array of bytes? – Leavis 30/3, 2009 at 2:19

should be len(cPickle.dumps(z)) – Steato 30/3, 2009 at 11:1

Try using a later version of the pickle protocol with the protocol parameter to pickle.dumps(). The default is 0 and is an ASCII text format. Ones greater than 1 (I suggest you use pickle.HIGHEST_PROTOCOL). Protocol formats 1 and 2 (and 3 but that's for py3k) are binary and should be more space conservative.

Collide answered 30/3, 2009 at 2:40 Comment(1)

Python 3 uses protocol 3 by default. – Teazel 13/11, 2014 at 9:13

Solution:

import zlib, cPickle

def zdumps(obj):
  return zlib.compress(cPickle.dumps(obj,cPickle.HIGHEST_PROTOCOL),9)

def zloads(zstr):
  return cPickle.loads(zlib.decompress(zstr))  

>>> len(zdumps(z))
128

Steato answered 30/3, 2009 at 10:38 Comment(3)

Here's something more on the subject: tinyurl.com/3ymhaj5 . Basically, if you're serializing to disk you can just do gzip.open() instead of open. – Tecumseh 15/11, 2010 at 10:40

@slack3r that link is dead. – Spoliation 5/3, 2013 at 17:53

'ascii' codec can't encode character u'\xda' in position 1: ordinal not in range(128) – Endermic 27/10, 2013 at 13:54

z.dumps() is already pickled string i.e., it can be unpickled using pickle.loads():

>>> z = numpy.zeros(1000, numpy.uint8)
>>> s = z.dumps()
>>> a = pickle.loads(s)
>>> all(a == z)
True

Leavis answered 11/5, 2010 at 7:26 Comment(0)

An improvement to vartec's answer, that seems a bit more memory efficient (since it doesn't force everything into a string):

def pickle(fname, obj):
    import cPickle, gzip
    cPickle.dump(obj=obj, file=gzip.open(fname, "wb", compresslevel=3), protocol=2)

def unpickle(fname):
    import cPickle, gzip
    return cPickle.load(gzip.open(fname, "rb"))

Longways answered 9/5, 2010 at 21:46 Comment(2)

-1 (1) Don't hard-code protocol numbers, use -1 or HIGHEST_PROTOCOL. (2) Subsequent compression is an ADD-ON and is irrelevant to his question. (3) Specifying compresslevel when decompressing is pointless; any such information that may be necessary to decompress the file would be stored in the header of the compressed file -- otherwise how would you be able to decompress a file if you didn't know what compression level was used? – Mckinnie 9/5, 2010 at 22:9

(1) Then py2 code won't read py3 objects. (2) the header says "an improvement to vartec's answer", which was using compression -- I think it used less mem, but it could have been a false impression... (3) fixed – Longways 11/5, 2010 at 7:15

Recommended topics

Hot tags