Scalable or online out-of-core multi-label classifiers

Asked 8/9, 2013 at 14:43 Answered 16/2, 2017 at 23:29

Solved machine-learning classification scikit-learn document-classification text-classification

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.

I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.

I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.

vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))

SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this

clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)

But this suffers from some serious issues:

OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?

Fenestrated answered 8/9, 2013 at 14:43 Comment(3)

Just a remark when you say "HashVectorizer which is better but not that scalable": HashVectorizer is perfectly scalable: if you throw twice as much computational resource you will process data twice faster (you can partition the data and run the processing in parallel thanks to it statelessness and bounded memory usage). This is the exact definition of scalability. I agree that HashVectorizer could probably be more optimized to work faster on the same computational resources but this has nothing to do with the scalability problem. – Gerlac 8/9, 2013 at 16:3

Thanks for the clarification. I do agree that HV is really advantageous over Tfidf, I wasn't sure on the data partitioning part. Now I did a small POC to partition data and run the HV on the parts separately and then combine the results later. What I meant initially was that the work on the algorithm part is a great achievement but still it can be made more scalable like you suggested to partition & run in parallel. (After I've done, I will submit a PR so that HV also has a n_jobs parameter) – Fenestrated 9/9, 2013 at 8:55

Unfortunately in the current implementation of joblib used in scikit-learn we use multiprocessing hence the input data has to be copied to be sent over to the subprocesses. So such a n_jobs parameter would add a significant overhead and might not be beneficial at all. If you really have large dataset it's better to handle many parallel out-of-core loops that deal with the data access (disk, DB, network...) themselves an avoid any memory copy. However such boiler plate code will probably never be included in scikit-learn as too project specific / frameworkish. – Gerlac 9/9, 2013 at 9:39

I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though. Not sure about speeding up hashing vectorizer. You gotta ask @Larsmans and @ogrisel for that ;)

Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.

Bonanno answered 8/9, 2013 at 15:2 Comment(6)

I am not surprised ;) – Bonanno 8/9, 2013 at 15:3

Thanks, if I were to code OvR by hand, which estimator would you recommend for this problem? Also, say, I fire up 35K estimators (n_labels) and individually fit them on the training data. How would I compute the labels from these? Those estimators with the individual predict_proba > 0.5 will have their labels associated with that sample. Will this approach work? (sorry, I'm just 3 weeks old at ML & sklearn) – Fenestrated 8/9, 2013 at 15:21

You can should try to train independent instances of SGDClassifier and PassiveAggressiveClassifier and maybe MultinomialNB as binary classifiers (one for each label). Then you can rank the top predictions based on the values of predict_proba or decision_function and take the top 5 labels (or less if they predict below 0.5 proba or negative decision function). You can also train a second regression model that takes the probas of the binary classification models and predict the expected number of positive labels (the value of k in top k) to retain for each instance. – Gerlac 8/9, 2013 at 16:10

+1 for linear models (why would you use multinomial instead of Bernoulli olivier?). I would really first try the thresholding and see how that works. If the labels are very unbalanced, you might need to adjust class weights. Btw, 35k is quite a lot. You might run into memory trouble. Keep in mind you need to store n_labels * n_features coefficients. – Bonanno 8/9, 2013 at 19:44

Thanks a lot for all your valuable suggestions. I am currently building a custom multi-label wrapper by hand over SGDClassifier. I'm using decision_function since they have only 1 float value while predict_proba has 2 values- one for 0 and one for 1 class. I will report my progress soon, or problems if I run into any. – Fenestrated 9/9, 2013 at 9:2

@AndreasMueller: MultinomialNB is the default option for text classification. BernoulliNB is better for problems with boolean variables, or for very short texts. – Found 9/9, 2013 at 9:29

The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
The trick in (1.) applies to any classification algorithm.

As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.

Found answered 8/9, 2013 at 15:4 Comment(7)

Thanks, I will try to implement OvR by hand and will try to circumvent scalability issues. I forgot to mention that the length of each document is very small (200 words or so). So, I figured that 1024 features should be enough because 2^18 were giving me a lot of memory problems. I even went to the extent of firing up an AWS instance of 30 GB RAM but that didn't work either. – Fenestrated 8/9, 2013 at 15:22

If you have 35K binary classifiers with 2 ** 18 features you will need 73GB just to store the aggregate model. It might be possible to sparsify the models once the weights are learned to spare memory at prediction time but AFAIK this is not implemented yet in scikit-learn. You can implement the decision_function manually with safe_sparse_dot to do that. – Gerlac 8/9, 2013 at 16:14

To train models that have many zero weights hence which would lead to improved memory usage once the coef_ attribute is stored as scipy.sparse matrix, you should use SGDClassifier with penalty="elasticnet" or "l1". – Gerlac 8/9, 2013 at 16:16

@ogrisel: linear classifiers have a sparsify method that converts the coef_ into a sparse matrix format (CSR). – Found 8/9, 2013 at 19:6

Great, I wasn't sure. Then this is the way to go Gaurav: scikit-learn.org/stable/modules/generated/… – Gerlac 8/9, 2013 at 19:41

Thanks, I'm currently using SGDClassifier. Do I use penalty='l2' and call sparsify() OR using penalty='l1' only is enough for having sparse coefficient matrices? – Fenestrated 9/9, 2013 at 8:59

@GauravKumar: you need both. The thing is that you can't train while the coefficients are sparse, so you sparsify after training (e.g. before serializing). – Found 9/9, 2013 at 9:28

My argument for scalability is that instead of using OneVsRest which is just a simplest of simplest baselines, you should use a more advanced ensemble of problem-transformation methods. In my paper I provide a scheme for dividing label space into subspaces and transforming the subproblems into multi-class single-label classifications using Label Powerset. To try this, just use the following code that utilizes a multi-label library built on top of scikit-learn - scikit-multilearn:

from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset

from sklearn.linear_model import SGDClassifier

# base multi-class classifier SGD
base_classifier = SGDClassifier(loss='log', penalty='l2', n_jobs=-1)

# problem transformation from multi-label to single-label multi-class
transformation_classifier = LabelPowerset(base_classifier)

# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True) 

# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)

clf.fit(x_train, y_train)
prediction = clf.predict(x_test)

Lucia answered 16/2, 2017 at 23:29 Comment(0)

The partial_fit() method was recently added to sklearn so hopefully it should be available in the upcoming release (it's in the master branch already).

The size of your problem makes it attractive to tackling it with neural networks. Have a look at magpie, it should give much better results than linear classifiers.

Lumper answered 31/8, 2016 at 12:7 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags