Save classifier to disk in scikit-learn

Asked 15/5, 2012 at 0:6 Answered 22/2, 2024 at 17:30

Solved python machine-learning scikit-learn classification

257

How do I save a trained Naive Bayes classifier to disk and use it to predict data?

I have the following sample program from the scikit-learn website:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

Hoar answered 15/5, 2012 at 0:6 Comment(0)

254

Classifiers are just objects that can be pickled and dumped like any other. To continue your example:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

Retiary answered 15/5, 2012 at 1:41 Comment(2)

Works like a charm! I was trying to use np.savez and load it back all along and that never helped. Thanks a lot. – Aigrette 29/1, 2014 at 9:29

in python3, use the pickle module, which works exactly like this. – Jovi 25/11, 2018 at 7:22

246

You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.

Joblib is included in scikit-learn:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

Edit: in Python 3.8+ it's now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).

Laccolith answered 23/6, 2012 at 13:16 Comment(7)

But from my understanding pipelining works if its part of a single work flow. If I want to build the model store it on disk and stop the the execution there. Then I come back a week later and try to load the model from disk it throws me an error : – Rosewood 3/1, 2014 at 20:24

There is no way to stop and resume the execution of the fit method if this is what you are looking for. That being said, joblib.load should not raise an exception after a successful joblib.dump if you call it from a Python with the same version of the scikit-learn library. – Laccolith 6/1, 2014 at 9:52

If you are using IPython, do not use the --pylab command line flag or the %pylab magic as the implicit namespace overloading is known to break the pickling process. Use explicit imports and the %matplotlib inline magic instead. – Laccolith 6/1, 2014 at 9:53

see the scikit-learn documentation for reference: scikit-learn.org/stable/tutorial/basic/… – Layne 6/6, 2014 at 18:47

Is it possible to retrain previously saved model? Specifically SVC models? – Coper 12/4, 2017 at 16:25

from sklearn.externals import joblib worked and import joblib did not. Very strange. – Polack 10/4, 2019 at 12:2

joblib was deprecated in scikit-learn 0.21 and will be removed in 0.23: scikit-learn.org/0.21/whats_new.html#miscellaneous – Battiste 22/4, 2020 at 15:7

136

What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.

So you have initialized your classifier and trained it for a long time with

clf = some.classifier()
clf.fit(X, y)

After this you have two options:

1) Using Pickle

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2) Using Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

One more time it is helpful to read the above-mentioned links

Fahrenheit answered 24/8, 2015 at 4:17 Comment(2)

Above Joblib process works for me. ['clf'= model name to use in file]. I use joblib.dump() in one file and load the model in another file using joblib.load() to save prediction time. – Merissa 9/12, 2020 at 10:56

@jtlz2 Try import joblib – Bayly 17/5, 2021 at 23:0

In many cases, particularly with text classification it is not enough just to store the classifier but you'll need to store the vectorizer as well so that you can vectorize your input in future.

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

future use case:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:

vectorizer.stop_words_ = None

to make dumping more efficient. Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping. Sparsify the model by:

clf.sparsify()

Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

and then you can store it more efficiently.

Corina answered 23/11, 2016 at 3:24 Comment(1)

Insightful answer! Just wanted to add in case of SVC, it returns a sparse model parameter. – Viscacha 1/7, 2019 at 14:48

sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator. Some estimators implement __getstate__ methods themselves, but others, like the GMM just use the base implementation which simply saves the objects inner dictionary:

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

The recommended method to save your model to disc is to use the pickle module:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

However, you should save additional data so you can retrain your model in the future, or suffer dire consequences (such as being locked into an old version of sklearn).

From the documentation:

In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:

The training data, e.g. a reference to a immutable snapshot

The python source code used to generate the model

The versions of scikit-learn and its dependencies

The cross validation score obtained on the training data

This is especially true for Ensemble estimators that rely on the tree.pyx module written in Cython(such as IsolationForest), since it creates a coupling to the implementation, which is not guaranteed to be stable between versions of sklearn. It has seen backwards incompatible changes in the past.

If your models become very large and loading becomes a nuisance, you can also use the more efficient joblib. From the documentation:

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:

Worl answered 7/11, 2017 at 11:53 Comment(2)

but can only pickle to the disk and not to a string But you could pickle this into StringIO from joblib. This is what I do all the time. – Anglicist 16/4, 2019 at 3:5

My current project is doing something similar, do you know what The training data, e.g. a reference to a immutable snapshot here? TIA! – Shulamith 9/7, 2020 at 15:25

sklearn.externals.joblib has been deprecated since 0.21 and will be removed in v0.23:

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/init.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)

Therefore, you need to install joblib:

pip install joblib

and finally write the model to disk:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

Now in order to read the dumped file all you need to run is:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)

Ethelethelbert answered 21/2, 2020 at 16:51 Comment(0)

In general, as of Feb 2024, other options available (as per docs: https://scikit-learn.org/stable/model_persistence.html )

skops: https://skops.readthedocs.io/en/latest/persistence.html
sklearn2ppml: https://github.com/jpmml/sklearn2pmml
sklearn-onyx: https://onnx.ai/sklearn-onnx/

Chara answered 22/2, 2024 at 17:30 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags