scikit-learn model persistence: pickle vs pmml vs ...?

I built an scikit-learn model and I want to reuse in a daily python cron job (NB: no other platforms are involved - no R, no Java &c).

I pickled it (actually, I pickled my own object whose one field is a GradientBoostingClassifier), and I un-pickle it in the cron job. So far so good (and has been discussed in Save classifier to disk in scikit-learn and Model persistence in Scikit-Learn?).

However, I upgraded sklearn and now I get these warnings:

.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator PriorProbabilityEstimator from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315: 
UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)

What do I do now?

I can downgrage to 0.18.1 and stick with it until I am ready to rebuild the model. For various reasons I find this unacceptable.
I can un-pickle the file and re-pickle it again. This worked with 0.18.2, but breaks with 0.19. NFG. joblib looks no better.
I wish I could save the data in a version-independent ASCII format (e.g., JSON or XML). This is, obviously, the optimal solution, but there seems to be NO way to do that (see also Sklearn - model persistence without pkl file).
I could save the model to PMML, but its support is lukewarm at best: I can use sklearn2pmml to save the model (although not easily), and augustus/lightpmmlpredictor to apply (although not load) the model. However, none of those is available to pip directly, which makes deployment a nightmare. Also, the augustus & lightpmmlpredictor projects seem to be dead. Importing PMML models into Python (Scikit-learn) - nope.
A variant of the above: save PMML using sklearn2pmml, and use openscoring for scoring. Requires interfacing with an external process. Yuk.

Suggestions?

Model persistence across different versions of scikit-learn is generally impossible. The reason is obvious: you pickle Class1 with one definition, and want to unpickle it into Class2 with another definition.

You can:

Still try to stick to one version of sklearn.
Ignore the warnings and hope that what worked for Class1 will work also for Class2.
Write your own class that can serialize your GradientBoostingClassifier and restore it from this serialized form, and hope that it would work better than pickle.

I made an example of how you can convert a single DecisionTreeRegressor into a pure list-and-dict format, fully JSON-compatible, and restore it back.

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_classification

### Code to serialize and deserialize trees

LEAF_ATTRIBUTES = ['children_left', 'children_right', 'threshold', 'value', 'feature', 'impurity', 'weighted_n_node_samples']
TREE_ATTRIBUTES = ['n_classes_', 'n_features_', 'n_outputs_']

def serialize_tree(tree):
    """ Convert a sklearn.tree.DecisionTreeRegressor into a json-compatible format """
    encoded = {
        'nodes': {},
        'tree': {},
        'n_leaves': len(tree.tree_.threshold),
        'params': tree.get_params()
    }
    for attr in LEAF_ATTRIBUTES:
        encoded['nodes'][attr] = getattr(tree.tree_, attr).tolist()
    for attr in TREE_ATTRIBUTES:
        encoded['tree'][attr] = getattr(tree, attr)
    return encoded

def deserialize_tree(encoded):
    """ Restore a sklearn.tree.DecisionTreeRegressor from a json-compatible format """
    x = np.arange(encoded['n_leaves'])
    tree = DecisionTreeRegressor().fit(x.reshape((-1,1)), x)
    tree.set_params(**encoded['params'])
    for attr in LEAF_ATTRIBUTES:
        for i in range(encoded['n_leaves']):
            getattr(tree.tree_, attr)[i] = encoded['nodes'][attr][i]
    for attr in TREE_ATTRIBUTES:
        setattr(tree, attr, encoded['tree'][attr])
    return tree

## test the code

X, y = make_classification(n_classes=3, n_informative=10)
tree = DecisionTreeRegressor().fit(X, y)
encoded = serialize_tree(tree)
decoded = deserialize_tree(encoded)
assert (decoded.predict(X)==tree.predict(X)).all()

Having this, you can go on to serialize and deserialize the whole GradientBoostingClassifier:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble.gradient_boosting import PriorProbabilityEstimator

def serialize_gbc(clf):
    encoded = {
        'classes_': clf.classes_.tolist(),
        'max_features_': clf.max_features_, 
        'n_classes_': clf.n_classes_,
        'n_features_': clf.n_features_,
        'train_score_': clf.train_score_.tolist(),
        'params': clf.get_params(),
        'estimators_shape': list(clf.estimators_.shape),
        'estimators': [],
        'priors':clf.init_.priors.tolist()
    }
    for tree in clf.estimators_.reshape((-1,)):
        encoded['estimators'].append(serialize_tree(tree))
    return encoded

def deserialize_gbc(encoded):
    x = np.array(encoded['classes_'])
    clf = GradientBoostingClassifier(**encoded['params']).fit(x.reshape(-1, 1), x)
    trees = [deserialize_tree(tree) for tree in encoded['estimators']]
    clf.estimators_ = np.array(trees).reshape(encoded['estimators_shape'])
    clf.init_ = PriorProbabilityEstimator()
    clf.init_.priors = np.array(encoded['priors'])
    clf.classes_ = np.array(encoded['classes_'])
    clf.train_score_ = np.array(encoded['train_score_'])
    clf.max_features_ = encoded['max_features_']
    clf.n_classes_ = encoded['n_classes_']
    clf.n_features_ = encoded['n_features_']
    return clf

# test on the same problem
clf = GradientBoostingClassifier()
clf.fit(X, y);
encoded = serialize_gbc(clf)
decoded = deserialize_gbc(encoded)
assert (decoded.predict(X) == clf.predict(X)).all()

This works for scikit-learn v0.19, but don't ask me what will come in the next versions to break this code. I'm neither a prophet nor a developer of sklearn.

If you want to be fully independent of new versions of sklearn, the safest thing is to write a function that traverses a serialized tree and makes the prediction, instead of re-creating an sklearn tree.

Recommended topics

Hot tags