Retraining an existing machine learning model with new data

Asked 29/11, 2018 at 7:3 Answered 25/3, 2021 at 1:41

python machine-learning nlp training-data supervised-learning

I have a ML model which is trained on a million data set (supervised classification on text) , however I want the same model to get trained again as soon as a new data comes in (training data).

This process is continuous and I don't want to loose the power of the model's prediction every time it receives a new data set. I don't want to merge the new data with my history data (~1 million samples) to train again.

So the ideal would be for this model to grow up gradually training on all data over a period of time and preserving the intelligence of the model every time it receives a new training set data. What is the best way to avoid retraining all historical data? A Code sample would help me.

Ethe answered 29/11, 2018 at 7:3 Comment(0)

You want to a look into incremental learning techniques for that. Many scikit-learn estimators have an option to do a partial_fit of the data, which means that you can incrementally train on small batches of data.

A common approach for these cases is to use SGDClassifier (or regressor), which is trained by taking a fraction of the samples to update the parameters of the model on each iteration, thus making it a natural candidate for online learning problems. However, you must retrain the model through the method partial_fit, otherwise it will train the whole model again.

From the documentation

SGD allows minibatch (online/out-of-core) learning, see the partial_fit method

Though as mentioned there are several other estimators in scikit-learn that have the partial-fit API implemented, as you can see in the section incremental learning, including MultinomialNB, linear_model.Perceptron and MiniBatchKMeans among others.

Here's a toy example to illustrate how it's used:

from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import SGDClassifier

X, y = load_iris(return_X_y=True)

clf = SGDClassifier()

kf = KFold(n_splits=2)
kf_splits = kf.split(X)

train_index, test_index = next(kf_splits)
# partial-fit with training data. On the first call
# the classes must be provided
clf.partial_fit(X[train_index], y[train_index], classes=np.unique(y))

# re-training on new data
train_index, test_index = next(kf_splits)
clf.partial_fit(X[train_index], y[train_index])

Characharabanc answered 29/11, 2018 at 14:13 Comment(0)

What you are looking for is incremental learning, there is an excellent library called creme which helps you with that.

All the tools in the library can be updated with a single observation at a time, and can therefore be used to learn from streaming data.

Here are some benefits of using creme (and online machine learning in general):

Incremental: models can update themselves in real-time. Adaptive: models can adapt to concept drift. Production-ready: working with data streams makes it simple to replicate production scenarios during model development. Efficient: models don't have to be retrained and require little compute power, which lowers their carbon footprint Fast: when the goal is to learn and predict with a single instance at a time, then creme is a order of magnitude faster than PyTorch, Tensorflow, and scikit-learn. 🔥 Features

Check out this: https://pypi.org/project/creme/

Unpleasantness answered 25/3, 2021 at 1:41 Comment(0)

Recommended topics

Hot tags