I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.
I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.
I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.
vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))
SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this
clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)
But this suffers from some serious issues:
- OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
- HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
- I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
- Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?
HashVectorizer
is perfectly scalable: if you throw twice as much computational resource you will process data twice faster (you can partition the data and run the processing in parallel thanks to it statelessness and bounded memory usage). This is the exact definition of scalability. I agree thatHashVectorizer
could probably be more optimized to work faster on the same computational resources but this has nothing to do with the scalability problem. – Gerlac