What are the pitfalls of using Dill to serialise scikit-learn/statsmodels models?

Asked 24/9, 2015 at 9:17 Answered 3/10, 2015 at 17:49

Solved python scikit-learn pickle statsmodels dill

I need to serialise scikit-learn/statsmodels models such that all the dependencies (code + data) are packaged in an artefact and this artefact can be used to initialise the model and make predictions. Using the pickle module is not an option because this will only take care of the data dependency (the code will not be packaged). So, I have been conducting experiments with Dill. To make my question more precise, the following is an example where I build a model and persist it.

from sklearn import datasets
from sklearn import svm
from sklearn.preprocessing import Normalizer
import dill

digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]

class Model:
    def __init__(self):
        self.normalizer = Normalizer()
        self.clf = svm.SVC(gamma=0.001, C=100.)
    def train(self, training_data_X, training_data_Y):
        normalised_training_data_X = normalizer.fit_transform(training_data_X)
        self.clf.fit(normalised_training_data_X, training_data_Y)
    def predict(self, test_data_X):
        return self.clf.predict(self.normalizer.fit_transform(test_data_X))  

model = Model()
model.train(training_data_X, training_data_Y)
print model.predict(test_data_X)
dill.dump(model, open("my_model.dill", 'w'))

Corresponding to this, here is how I initialise the persisted model (in a new session) and make a prediction. Note that this code does not explicitly initialise or have knowledge of the class Model.

import dill
from sklearn import datasets

digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]

with open("my_model.dill") as model_file:
    model = dill.load(model_file)

print model.predict(test_data_X)

Has anyone used Dill isn this way?. The idea is for a data scientist to extend a ModelWrapper class for each model they implement and then build the infrastructure around this that persists the models, deploy the models as services and manage the entire lifecycle of the model.

class ModelWrapper(object):
    __metaclass__ = abc.ABCMeta
    def __init__(self, model):
        self.model = model
    @abc.abstractmethod
    def predict(self, input):
        return
    def dumps(self):
        return dill.dumps(self)
    def loads(self, model_string):
        self.model = dill.loads(model_string)

Other than the security implications (arbitrary code execution) and the requirement that modules like scikit-learn will have to be installed on the machine thats serving the model, are there and any other pitfalls in this approach? Any comments or words of advice would be most helpful.

I think that YHat and Dato have taken similar approach but rolled out there own implementations of Dill for similar purposes.

Thriller answered 24/9, 2015 at 9:17 Comment(1)

Ok. I have a working prototype of this and it seems to work fine. Now, I need to do the same for R. Any pointers on this? – Thriller 13/10, 2015 at 8:3

I'm the dill author. dill was built to do exactly what you are doing… (to persist numerical fits within class instances for statistics) where these objects can then be distributed to different resources and run in an embarrassingly parallel fashion. So, the answer is yes -- I have run code like yours, using mystic and/or sklearn.

Note that many of the authors of sklearn use cloudpickle for enabling parallel computing on sklearn objects, and not dill. dill can pickle more types of objects than cloudpickle, however cloudpickle is slightly better (at this time of writing) at pickling objects that make references to the global dictionary as part of a closure -- by default, dill does this by reference, while cloudpickle physically stores the dependencies. However, dill has a "recurse" mode, that acts like cloudpickle, so the difference when using this mode is minor. (To enable "recurse" mode, do dill.settings['recurse'] = True, or use recurse=True as a flag in dill.dump). Another minor difference is that cloudpickle contains special support for things like scikits.timeseries and PIL.Image, while dill does not.

On the plus side, dill does not pickle classes by reference, so by pickling a class instance, it serializes the class object itself -- which is a big advantage, as it serializes instances of derived classes of classifiers, models, and etc from sklearn in their exact state at the time of pickling… so if you make modifications to the class object, the instance still unpicks correctly. There are other advantages of dill over cloudpickle, aside from the broader range of objects (and typically a smaller pickle) -- however, I won't list them here. You asked for pitfalls, so differences are not pitfalls.

Major pitfalls:

You should have anything your classes refer to installed on the remote machine, just in case dill (or cloudpickle) pickles it by reference.
You should try to make your classes and class methods as self-contained as possible (e.g. don't refer to objects defined in the global scope from your classes).
sklearn objects can be big, so saving many of them to a single pickle is not always a good idea… you might want to use klepto which has a dict interface to caching and archiving, and enables you to configure the archive interface to store each key-value pair individually (e.g. one entry per file).

Yt answered 3/10, 2015 at 17:49 Comment(2)

Hello author, How do I make dill dump result smaller ，Is there any compression levels option？ – Compaction 1/12, 2019 at 15:10

@CSQGB: you can chose one of the different serialization protocols... they will produce different sized pickles. Higher version number tends to be smaller, I believe. In terms of compression, A library like klepto (which uses dill) can provide a compressed serialized object. – Yt 2/12, 2019 at 20:13

Ok to begin with, in your sample code pickle could work fine, I use pickle all the time to package a model and use it later, unless you want to send the model directly to another server or save the interpreter state, because that is what Dill is good at and pickle can not do. It also depends on your code, what types etc. you use, pickle might fail, Dill is more stable.

Dill is primarly based on pickle and so they are very similar, some things you should take into account / look into:

Limitations of Dill

frame, generator, traceback standard types can not be packaged.
cloudpickle might be a good idea for your problem as well, it has better support in pickling objects (than pickle, not per see better than Dill) and you can pickle code easily as well.

Once the target machine has the correct libraries loaded, (be carefull for different python versions as well, because they may bug your code), everything should work fine with both Dill and cloudpickle, as long as you do not use the unsuported standard types.

Hope this helps.

Petaloid answered 29/9, 2015 at 7:59 Comment(2)

I need to run the model on another server. The idea is to store models in a key-value store and bring them up behind a service endpoint as and when required. – Thriller 29/9, 2015 at 11:47

Then both Dill and cloudpickle should work fine for what you want. – Petaloid 29/9, 2015 at 11:59

I package gaussian process (GP) from scikit-learn using pickle.

The primary reason is because the GP takes long time to build and loads much faster using pickle. So in my code initialization I check whether the data files for model got updated and re-generate the model if necessary, otherwise just de-serialize it from pickle!

I would use pickle, dill, cloudpickle in the respective order.

Note that pickle includes protocol keyword argument and some values can speed up and reduce memory usage significantly! Finally I wrap pickle code with compression from CPython STL if necessary.

Teledu answered 3/10, 2015 at 12:37 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags