Serialize a python function with dependencies
Asked Answered
B

1

14

I have tried multiple approaches to pickle a python function with dependencies, following many recommendations on StackOverflow, (such as dill, cloudpickle, etc.) but all seem to run into a fundamental issue that I cannot figure out.

I have a main module that tries to pickle a function from an imported module, sends it over ssh to be unpickled and executed at a remote machine.

So main has:

    import dill (for example)
    import modulea

    serial=dill.dumps( modulea.func )
    send (serial)

On the remote machine:

        import dill
        receive serial
        funcremote = dill.loads( serial )
        funcremote()

If the functions being pickled and sent are top level functions defined in main itself, everything works. When they are in an imported module, the loads function fails with messages of the type "module modulea not found".

It appears that the module name is pickled along with the function name. I do not see any way to "fix up" the pickle to remove the dependency, or alternately, to create a dummy module in the receiver to become the recipient of the unpickling.

Any pointers will be much appreciated.

--prasanna

Bonacci answered 15/10, 2014 at 18:53 Comment(3)
It is an easy thing to do to replace the module name of a function upon unpicking. During the serialization, you could replace any __name__ with __main__, and voila… it should work… that is, unless the function has any dependencies in the enclosing module. Then it will fail.Bramante
The problem is that the dill.loads fails -- it never unpickles. As you point out correctly in your post below, since <modula> in the example above is not available, the loads dies. So renaming it after the fact doesn't help.Bonacci
I'm not talking about renaming it after the fact, I'm talking about replacing the attribute at load time with a custom pickler. That would work as detailed above.Bramante
B
20

I'm the dill author. I do this exact thing over ssh, but with success. Currently, dill and any of the other serializers pickle modules by reference… so to successfully pass a function defined in a file, you have to ensure that the relevant module is also installed on the other machine. I do not believe there is any object serializer that serializes modules directly (i.e. not by reference).

Having said that, dill does have some options to serialize object dependencies. For example, for class instances, the default in dill is to not serialize class instances by reference… so the class definition can also be serialized and send with the instance. In dill, you can also (use a very new feature to) serialize file handles by serializing the file, instead of the doing so by reference. But again, if you have the case of a function defined in a module, you are out-of-luck, as modules are serialized by reference pretty darn universally.

You might be able to use dill to do so, however, just not with pickling the object, but with extracting the source and sending the source code. In pathos.pp and pyina, dill us used to extract the source and the dependencies of any object (including functions), and pass them to another computer/process/etc. However, since this is not an easy thing to do, dill can also use the failover of trying to extract a relevant import and send that instead of the source code.

You can understand, hopefully, this is a messy messy thing to do (as noted in one of the dependencies of the function I am extracting below). However, what you are asking is successfully done in the pathos package to pass code and dependencies to different machines across ssh-tunneled ports.

>>> import dill
>>> 
>>> print dill.source.importable(dill.source.importable)
from dill.source import importable
>>> print dill.source.importable(dill.source.importable, source=True)
def _closuredsource(func, alias=''):
    """get source code for closured objects; return a dict of 'name'
    and 'code blocks'"""
    #FIXME: this entire function is a messy messy HACK
    #      - pollutes global namespace
    #      - fails if name of freevars are reused
    #      - can unnecessarily duplicate function code
    from dill.detect import freevars
    free_vars = freevars(func)
    func_vars = {}
    # split into 'funcs' and 'non-funcs'
    for name,obj in list(free_vars.items()):
        if not isfunction(obj):
            # get source for 'non-funcs'
            free_vars[name] = getsource(obj, force=True, alias=name)
            continue
        # get source for 'funcs'

#…snip… …snip… …snip… …snip… …snip… 

            # get source code of objects referred to by obj in global scope
            from dill.detect import globalvars
            obj = globalvars(obj) #XXX: don't worry about alias?
            obj = list(getsource(_obj,name,force=True) for (name,_obj) in obj.items())
            obj = '\n'.join(obj) if obj else ''
            # combine all referred-to source (global then enclosing)
            if not obj: return src
            if not src: return obj
            return obj + src
        except:
            if tried_import: raise
            tried_source = True
            source = not source
    # should never get here
    return

I imagine something could also be built around the dill.detect.parents method, which provides a list of pointers to all parent object for any given object… and one could reconstruct all of any function's dependencies as objects… but this is not implemented.

BTW: to establish a ssh tunnel, just do this:

>>> t = pathos.Tunnel.Tunnel()
>>> t.connect('login.university.edu')
39322
>>> t  
Tunnel('-q -N -L39322:login.university.edu:45075 login.university.edu')

Then you can work across the local port with ZMQ, or ssh, or whatever. If you want to do so with ssh, pathos also has that built in.

Bramante answered 15/10, 2014 at 22:25 Comment(3)
Yes, I have been playing with this since yesterday. Trying something simpler--my module is fairly clean in that each function i am interested in only uses one or two of a small handfull of utility functions in the module. So, I can dill/pickle and send over each of the utility functions first, and then just send the function that I want executed in that context.Bonacci
BTW I will look into the pathos tunnels. I am using execnet at this time (codespeak.net/execnet/#)Bonacci
If your module consists of only one file, you could pickle the module with dill.source.getsource, and then pickle the function as an object and send it afterward. Or, as I mention in my comment on your question above, you could extend the dill.Pickler and dill.Unpickler check the __module__ attribute for any function, and if the given module is not available, then set __module__ = '__main__' and is should work as long as there are no missing dependencies.Bramante

© 2022 - 2024 — McMap. All rights reserved.