Python: Pickling a dict with some unpicklable items
Asked Answered
C

5

12

I have an object gui_project which has an attribute .namespace, which is a namespace dict. (i.e. a dict from strings to objects.)

(This is used in an IDE-like program to let the user define his own object in a Python shell.)

I want to pickle this gui_project, along with the namespace. Problem is, some objects in the namespace (i.e. values of the .namespace dict) are not picklable objects. For example, some of them refer to wxPython widgets.

I'd like to filter out the unpicklable objects, that is, exclude them from the pickled version.

How can I do this?

(One thing I tried is to go one by one on the values and try to pickle them, but some infinite recursion happened, and I need to be safe from that.)

(I do implement a GuiProject.__getstate__ method right now, to get rid of other unpicklable stuff besides namespace.)

Cuddle answered 2/11, 2010 at 18:4 Comment(0)
C
1

I ended up coding my own solution to this, using Shane Hathaway's approach.

Here's the code. (Look for CutePickler and CuteUnpickler.) Here are the tests. It's part of GarlicSim, so you can use it by installing garlicsim and doing from garlicsim.general_misc import pickle_tools.

If you want to use it on Python 3 code, use the Python 3 fork of garlicsim.

Cuddle answered 24/12, 2010 at 11:51 Comment(1)
Perhaps you should've made that pickle part into a separate module for easier reusability (although, it seems, the pickle_tools do actually use quite a bit from the general_misc). Also, it still fails on (some) function objects.Toms
B
7

I would use the pickler's documented support for persistent object references. Persistent object references are objects that are referenced by the pickle but not stored in the pickle.

http://docs.python.org/library/pickle.html#pickling-and-unpickling-external-objects

ZODB has used this API for years, so it's very stable. When unpickling, you can replace the object references with anything you like. In your case, you would want to replace the object references with markers indicating that the objects could not be pickled.

You could start with something like this (untested):

import cPickle

def persistent_id(obj):
    if isinstance(obj, wxObject):
        return "filtered:wxObject"
    else:
        return None

class FilteredObject:
    def __init__(self, about):
        self.about = about
    def __repr__(self):
        return 'FilteredObject(%s)' % repr(self.about)

def persistent_load(obj_id):
    if obj_id.startswith('filtered:'):
        return FilteredObject(obj_id[9:])
    else:
        raise cPickle.UnpicklingError('Invalid persistent id')

def dump_filtered(obj, file):
    p = cPickle.Pickler(file)
    p.persistent_id = persistent_id
    p.dump(obj)

def load_filtered(file)
    u = cPickle.Unpickler(file)
    u.persistent_load = persistent_load
    return u.load()

Then just call dump_filtered() and load_filtered() instead of pickle.dump() and pickle.load(). wxPython objects will be pickled as persistent IDs, to be replaced with FilteredObjects at unpickling time.

You could make the solution more generic by filtering out objects that are not of the built-in types and have no __getstate__ method.

Update (15 Nov 2010): Here is a way to achieve the same thing with wrapper classes. Using wrapper classes instead of subclasses, it's possible to stay within the documented API.

from cPickle import Pickler, Unpickler, UnpicklingError


class FilteredObject:
    def __init__(self, about):
        self.about = about
    def __repr__(self):
        return 'FilteredObject(%s)' % repr(self.about)


class MyPickler(object):

    def __init__(self, file, protocol=0):
        pickler = Pickler(file, protocol)
        pickler.persistent_id = self.persistent_id
        self.dump = pickler.dump
        self.clear_memo = pickler.clear_memo

    def persistent_id(self, obj):
        if not hasattr(obj, '__getstate__') and not isinstance(obj,
            (basestring, int, long, float, tuple, list, set, dict)):
            return "filtered:%s" % type(obj)
        else:
            return None


class MyUnpickler(object):

    def __init__(self, file):
        unpickler = Unpickler(file)
        unpickler.persistent_load = self.persistent_load
        self.load = unpickler.load
        self.noload = unpickler.noload

    def persistent_load(self, obj_id):
        if obj_id.startswith('filtered:'):
            return FilteredObject(obj_id[9:])
        else:
            raise UnpicklingError('Invalid persistent id')


if __name__ == '__main__':
    from cStringIO import StringIO

    class UnpickleableThing(object):
        pass

    f = StringIO()
    p = MyPickler(f)
    p.dump({'a': 1, 'b': UnpickleableThing()})

    f.seek(0)
    u = MyUnpickler(f)
    obj = u.load()
    print obj

    assert obj['a'] == 1
    assert isinstance(obj['b'], FilteredObject)
    assert obj['b'].about
Bilicki answered 10/11, 2010 at 22:27 Comment(7)
Would it be possible to use this solution, but instead of using custom dump and load functions, to use a custom Pickler class? If so, will I also need to subclass Unpickler? How will they know to work together, for example what if someone would try to use the stock loads to unpickle something pickled by my Pickler subclass?Cuddle
You can always subclass picklers and unpicklers, but then you are in undocumented territory. About working together: if you try to unpickle something that contains persistent IDs using an unpickler that does not have a persistent_load function, you will get an exception.Bilicki
I added an example that uses wrapper classes instead of standalone functions or subclasses.Bilicki
Is it possible, inside the GuiProject.__getstate__ function, to find out which Pickler subclass is pickling us, in order to assert that it's our special pickler?Cuddle
Also, why did you encapsulate Pickler and Unpickler instead of subclassing them? Is there some kind of problem with subclassing them?Cuddle
You can't subclass cPickle.Pickler and cPickle.Unpickler; you'll get some kind of type error or the like. I always use cPickle rather than the pure-Python pickle module since cPickle is literally about 1000X faster.Bilicki
Regarding GuiProject.__getstate__ finding out which Pickler is pickling us: maybe you could include something in the __getstate__ return value that your custom pickler can pickle but which the standard pickler would choke on.Bilicki
M
1

This is how I would do this (I did something similar before and it worked):

  1. Write a function that determines whether or not an object is pickleable
  2. Make a list of all the pickleable variables, based on the above function
  3. Make a new dictionary (called D) that stores all the non-pickleable variables
  4. For each variable in D (this only works if you have very similar variables in d) make a list of strings, where each string is legal python code, such that when all these strings are executed in order, you get the desired variable

Now, when you unpickle, you get back all the variables that were originally pickleable. For all variables that were not pickleable, you now have a list of strings (legal python code) that when executed in order, gives you the desired variable.

Hope this helps

Mathilda answered 11/11, 2010 at 23:35 Comment(0)
C
1

I ended up coding my own solution to this, using Shane Hathaway's approach.

Here's the code. (Look for CutePickler and CuteUnpickler.) Here are the tests. It's part of GarlicSim, so you can use it by installing garlicsim and doing from garlicsim.general_misc import pickle_tools.

If you want to use it on Python 3 code, use the Python 3 fork of garlicsim.

Cuddle answered 24/12, 2010 at 11:51 Comment(1)
Perhaps you should've made that pickle part into a separate module for easier reusability (although, it seems, the pickle_tools do actually use quite a bit from the general_misc). Also, it still fails on (some) function objects.Toms
Q
0

One approach would be to inherit from pickle.Pickler, and override the save_dict() method. Copy it from the base class, which reads like this:

def save_dict(self, obj):
    write = self.write

    if self.bin:
        write(EMPTY_DICT)
    else:   # proto 0 -- can't use EMPTY_DICT
        write(MARK + DICT)

    self.memoize(obj)
    self._batch_setitems(obj.iteritems())

However, in the _batch_setitems, pass an iterator that filters out all items that you don't want to be dumped, e.g

def save_dict(self, obj):
    write = self.write

    if self.bin:
        write(EMPTY_DICT)
    else:   # proto 0 -- can't use EMPTY_DICT
        write(MARK + DICT)

    self.memoize(obj)
    self._batch_setitems(item for item in obj.iteritems() 
                         if not isinstance(item[1], bad_type))

As save_dict isn't an official API, you need to check for each new Python version whether this override is still correct.

Quash answered 2/11, 2010 at 18:35 Comment(1)
Hm, is there a more portable solution? Asides from the fact that save_dict isn't an official API (and I'd have to verify it not only for different versions but different implementations), I would not want to require people who want to pickle gui_project to use a custom pickler like this. If I won't have a better choice I'll take this solution.Cuddle
V
0

The filtering part is indeed tricky. Using simple tricks, you can easily get the pickle to work. However, you might end up filtering out too much and losing information that you could keep when the filter looks a little bit deeper. But the vast possibility of things that can end up in the .namespace makes building a good filter difficult.

However, we could leverage pieces that are already part of Python, such as deepcopy in the copy module.

I made a copy of the stock copy module, and did the following things:

  1. create a new type named LostObject to represent object that will be lost in pickling.
  2. change _deepcopy_atomic to make sure x is picklable. If it's not, return an instance of LostObject
  3. objects can define methods __reduce__ and/or __reduce_ex__ to provide hint about whether and how to pickle it. We make sure these methods will not throw exception to provide hint that it cannot be pickled.
  4. to avoid making unnecessary copy of big object (a la actual deepcopy), we recursively check whether an object is picklable, and only make unpicklable part. For instance, for a tuple of a picklable list and and an unpickable object, we will make a copy of the tuple - just the container - but not its member list.

The following is the diff:

[~/Development/scratch/] $ diff -uN  /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/copy.py mcopy.py
--- /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/copy.py  2010-01-09 00:18:38.000000000 -0800
+++ mcopy.py    2010-11-10 08:50:26.000000000 -0800
@@ -157,6 +157,13 @@

     cls = type(x)

+    # if x is picklable, there is no need to make a new copy, just ref it
+    try:
+        dumps(x)
+        return x
+    except TypeError:
+        pass
+
     copier = _deepcopy_dispatch.get(cls)
     if copier:
         y = copier(x, memo)
@@ -179,10 +186,18 @@
                     reductor = getattr(x, "__reduce_ex__", None)
                     if reductor:
                         rv = reductor(2)
+                        try:
+                            x.__reduce_ex__()
+                        except TypeError:
+                            rv = LostObject, tuple()
                     else:
                         reductor = getattr(x, "__reduce__", None)
                         if reductor:
                             rv = reductor()
+                            try:
+                                x.__reduce__()
+                            except TypeError:
+                                rv = LostObject, tuple()
                         else:
                             raise Error(
                                 "un(deep)copyable object of type %s" % cls)
@@ -194,7 +209,12 @@

 _deepcopy_dispatch = d = {}

+from pickle import dumps
+class LostObject(object): pass
 def _deepcopy_atomic(x, memo):
+    try:
+        dumps(x)
+    except TypeError: return LostObject()
     return x
 d[type(None)] = _deepcopy_atomic
 d[type(Ellipsis)] = _deepcopy_atomic

Now back to the pickling part. You simply make a deepcopy using this new deepcopy function and then pickle the copy. The unpicklable parts have been removed during the copying process.

x = dict(a=1)
xx = dict(x=x)
x['xx'] = xx
x['f'] = file('/tmp/1', 'w')
class List():
    def __init__(self, *args, **kwargs):
        print 'making a copy of a list'
        self.data = list(*args, **kwargs)
x['large'] = List(range(1000))
# now x contains a loop and a unpickable file object
# the following line will throw
from pickle import dumps, loads
try:
    dumps(x)
except TypeError:
    print 'yes, it throws'

def check_picklable(x):
    try:
        dumps(x)
    except TypeError:
        return False
    return True

class LostObject(object): pass

from mcopy import deepcopy

# though x has a big List object, this deepcopy will not make a new copy of it
c = deepcopy(x)
dumps(c)
cc = loads(dumps(c))
# check loop refrence
if cc['xx']['x'] == cc:
    print 'yes, loop reference is preserved'
# check unpickable part
if isinstance(cc['f'], LostObject):
    print 'unpicklable part is now an instance of LostObject'
# check large object
if loads(dumps(c))['large'].data[999] == x['large'].data[999]:
    print 'large object is ok'

Here is the output:

making a copy of a list
yes, it throws
yes, loop reference is preserved
unpicklable part is now an instance of LostObject
large object is ok

You see that 1) mutual pointers (between x and xx) are preserved and we do not run into infinite loop; 2) the unpicklable file object is converted to a LostObject instance; and 3) not new copy of the large object is created since it is picklable.

Vulcanite answered 10/11, 2010 at 0:43 Comment(2)
Will this involve actual deepcopying of the objects in .namespace? I mean, the user might have huge objects there, and I wouldn't want to duplicate them.Cuddle
I just made a change to make it more memory efficient. If an object is picklable, then no real copy will be made. If it is not picklable, the program recursively makes shallow copies for its picklable parts. For instance, a 2-tuple containing a picklable list and unpickable atomic object itself is not picklable. To copy it, the new deepcopy function will create a new 2-tuple whose first element will point to the same picklable list and whose second element will be a LostObject instance. No copy of the list is created. Now the 2-tuple is picklable.Vulcanite

© 2022 - 2024 — McMap. All rights reserved.