You want to a look into incremental learning techniques for that. Many scikit-learn
estimators have an option to do a partial_fit
of the data, which means that you can incrementally train on small batches of data.
A common approach for these cases is to use SGDClassifier
(or regressor), which is trained by taking a fraction of the samples to update the parameters of the model on each iteration, thus making it a natural candidate for online learning problems. However, you must retrain the model through the method partial_fit
, otherwise it will train the whole model again.
From the documentation
SGD allows minibatch (online/out-of-core) learning, see the partial_fit method
Though as mentioned there are several other estimators in scikit-learn
that have the partial-fit
API implemented, as you can see in the section incremental learning, including MultinomialNB
, linear_model.Perceptron
and MiniBatchKMeans
among others.
Here's a toy example to illustrate how it's used:
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import SGDClassifier
X, y = load_iris(return_X_y=True)
clf = SGDClassifier()
kf = KFold(n_splits=2)
kf_splits = kf.split(X)
train_index, test_index = next(kf_splits)
# partial-fit with training data. On the first call
# the classes must be provided
clf.partial_fit(X[train_index], y[train_index], classes=np.unique(y))
# re-training on new data
train_index, test_index = next(kf_splits)
clf.partial_fit(X[train_index], y[train_index])