For some algorithms supporting partial_fit
, it would be possible to write an outer loop in a script to do out-of-core, large scale text classification. However there are some missing elements: a dataset reader that iterates over the data on the disk as folders of flat files or a SQL database server, or NoSQL store or a Solr index with stored fields for instance. We also lack an online text vectorizer.
Here is a sample integration template to explain how it would fit together.
import numpy as np
from sklearn.linear_model import Perceptron
from mymodule import SomeTextDocumentVectorizer
from mymodule import DataSetReader
dataset_reader = DataSetReader('/path/to/raw/data')
expected_classes = dataset_reader.get_all_classes() # need to know the possible classes ahead of time
feature_extractor = SomeTextDocumentVectorizer()
classifier = Perceptron()
dataset_reader = DataSetReader('/path/to/raw/data')
for i, (documents, labels) in enumerate(dataset_reader.iter_chunks()):
vectors = feature_extractor.transform(documents)
classifier.partial_fit(vectors, labels, classes=expected_classes)
if i % 100 == 0:
# dump model to be able to monitor quality and later analyse convergence externally
joblib.dump(classifier, 'model_%04d.pkl' % i)
The dataset reader class is application specific and will probably never make it into scikit-learn (except maybe for a folder of flat text files or CSV files that would not require to add a new dependency to the library).
The text vectorizer part is more problematic. The current vectorizer does not have a partial_fit
method because of the way we build the in-memory vocabulary (a python dict that is trimmed depending on max_df and min_df). We could maybe build one using an external store and drop the max_df and min_df features.
Alternatively we could build an HashingTextVectorizer that would use the hashing trick to drop the dictionary requirements. None of those exist at the moment (although we already have some building blocks such as a murmurhash wrapper and a pull request for hashing features).
In the mean time I would advise you to have a look at Vowpal Wabbit and maybe those python bindings.
Edit: The sklearn.feature_extraction.FeatureHasher
class has been merged into the master branch of scikit-learn and will be available in the next release (0.13). Have a look at the documentation on feature extraction.
Edit 2: 0.13 is now released with both FeatureHasher
and HashingVectorizer
that can directly deal with text data.
Edit 3: there is now an example on out-of-core learning with the Reuters dataset in the official example gallery of the project.