more efficient way to pickle a string
Asked Answered
L

4

9

The pickle module seems to use string escape characters when pickling; this becomes inefficient e.g. on numpy arrays. Consider the following

z = numpy.zeros(1000, numpy.uint8)
len(z.dumps())
len(cPickle.dumps(z.dumps()))

The lengths are 1133 characters and 4249 characters respectively.

z.dumps() reveals something like "\x00\x00" (actual zeros in string), but pickle seems to be using the string's repr() function, yielding "'\x00\x00'" (zeros being ascii zeros).

i.e. ("0" in z.dumps() == False) and ("0" in cPickle.dumps(z.dumps()) == True)

Longways answered 30/3, 2009 at 1:48 Comment(3)
You should add a specific question to your post here.Sporophore
What do you want to serialize a Python string or a numpy array of bytes?Leavis
should be len(cPickle.dumps(z))Steato
C
24

Try using a later version of the pickle protocol with the protocol parameter to pickle.dumps(). The default is 0 and is an ASCII text format. Ones greater than 1 (I suggest you use pickle.HIGHEST_PROTOCOL). Protocol formats 1 and 2 (and 3 but that's for py3k) are binary and should be more space conservative.

Collide answered 30/3, 2009 at 2:40 Comment(1)
Python 3 uses protocol 3 by default.Teazel
S
9

Solution:

import zlib, cPickle

def zdumps(obj):
  return zlib.compress(cPickle.dumps(obj,cPickle.HIGHEST_PROTOCOL),9)

def zloads(zstr):
  return cPickle.loads(zlib.decompress(zstr))  

>>> len(zdumps(z))
128
Steato answered 30/3, 2009 at 10:38 Comment(3)
Here's something more on the subject: tinyurl.com/3ymhaj5 . Basically, if you're serializing to disk you can just do gzip.open() instead of open.Tecumseh
@slack3r that link is dead.Spoliation
'ascii' codec can't encode character u'\xda' in position 1: ordinal not in range(128)Endermic
L
3

z.dumps() is already pickled string i.e., it can be unpickled using pickle.loads():

>>> z = numpy.zeros(1000, numpy.uint8)
>>> s = z.dumps()
>>> a = pickle.loads(s)
>>> all(a == z)
True
Leavis answered 11/5, 2010 at 7:26 Comment(0)
L
1

An improvement to vartec's answer, that seems a bit more memory efficient (since it doesn't force everything into a string):

def pickle(fname, obj):
    import cPickle, gzip
    cPickle.dump(obj=obj, file=gzip.open(fname, "wb", compresslevel=3), protocol=2)

def unpickle(fname):
    import cPickle, gzip
    return cPickle.load(gzip.open(fname, "rb"))
Longways answered 9/5, 2010 at 21:46 Comment(2)
-1 (1) Don't hard-code protocol numbers, use -1 or HIGHEST_PROTOCOL. (2) Subsequent compression is an ADD-ON and is irrelevant to his question. (3) Specifying compresslevel when decompressing is pointless; any such information that may be necessary to decompress the file would be stored in the header of the compressed file -- otherwise how would you be able to decompress a file if you didn't know what compression level was used?Mckinnie
(1) Then py2 code won't read py3 objects. (2) the header says "an improvement to vartec's answer", which was using compression -- I think it used less mem, but it could have been a false impression... (3) fixedLongways

© 2022 - 2024 — McMap. All rights reserved.