I need to serialise scikit-learn/statsmodels models such that all the dependencies (code + data) are packaged in an artefact and this artefact can be used to initialise the model and make predictions. Using the pickle module
is not an option because this will only take care of the data dependency (the code will not be packaged). So, I have been conducting experiments with Dill. To make my question more precise, the following is an example where I build a model and persist it.
from sklearn import datasets
from sklearn import svm
from sklearn.preprocessing import Normalizer
import dill
digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]
class Model:
def __init__(self):
self.normalizer = Normalizer()
self.clf = svm.SVC(gamma=0.001, C=100.)
def train(self, training_data_X, training_data_Y):
normalised_training_data_X = normalizer.fit_transform(training_data_X)
self.clf.fit(normalised_training_data_X, training_data_Y)
def predict(self, test_data_X):
return self.clf.predict(self.normalizer.fit_transform(test_data_X))
model = Model()
model.train(training_data_X, training_data_Y)
print model.predict(test_data_X)
dill.dump(model, open("my_model.dill", 'w'))
Corresponding to this, here is how I initialise the persisted model (in a new session) and make a prediction. Note that this code does not explicitly initialise or have knowledge of the class Model
.
import dill
from sklearn import datasets
digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]
with open("my_model.dill") as model_file:
model = dill.load(model_file)
print model.predict(test_data_X)
Has anyone used Dill isn this way?. The idea is for a data scientist to extend a ModelWrapper class
for each model they implement and then build the infrastructure around this that persists the models, deploy the models as services and manage the entire lifecycle of the model.
class ModelWrapper(object):
__metaclass__ = abc.ABCMeta
def __init__(self, model):
self.model = model
@abc.abstractmethod
def predict(self, input):
return
def dumps(self):
return dill.dumps(self)
def loads(self, model_string):
self.model = dill.loads(model_string)
Other than the security implications (arbitrary code execution) and the requirement that modules like scikit-learn
will have to be installed on the machine thats serving the model, are there and any other pitfalls in this approach? Any comments or words of advice would be most helpful.
I think that YHat and Dato have taken similar approach but rolled out there own implementations of Dill for similar purposes.