Multi-Label Document Classification

Asked 21/5, 2013 at 15:6 Answered 16/9, 2013 at 8:59

java machine-learning text-mining document-classification

I have a database in which I store data based upon the following three fields: id, text, {labels}. Note that each text has been assigned to more than one label \ tag \ class. I want to build a model (weka \ rapidminer \ mahout) that will be able to recommend \ classify a bunch of labels \ tags \ classes to a given text.

I have heard about SVM and Naive Bayes Classifier, but not sure whether they support multi-label classification or not. Anything that guides me to the right direction is more than welcome!

Afterpiece answered 21/5, 2013 at 15:6 Comment(0)

The basic multilabel classification method is one-vs.-the-rest (OvR), also called binary relevance (BR). The basic idea is that you take an off-the-shelf binary classifier, such as Naive Bayes or an SVM, then create K instances of it to solve K independent classification problems. In Python-like pseudocode:

for each class k:
    learner = SVM(settings)  # for example
    labels = [class_of(x) == k for x in samples]
    learner.learn(samples, labels)

Then at prediction time, you just run each of the binary classifiers on a sample and collect the labels for which they predict positive.

(Both training and prediction can obviously be done in parallel, since the problems are assumed to be independent. See Wikipedia for links to two Java packages that do multi-label classification.)

Dorsy answered 21/5, 2013 at 15:30 Comment(2)

There's plenty of scope to go beyond a series of independent problems too. For example, with a probabilistic classifier (logistic regression, let's say), you can define a distribution over the resulting label set, e.g. a topic model or MRF, and optimise globally. I'm sure you could incorporate a similar idea into an SVM too using Platt's method, or some direct discriminative global criterion. – Outline 22/5, 2013 at 8:56

@BenAllison: sure, but I'm just pointing out the baseline approach and a bunch of toolkits that do more advanced stuff. – Dorsy 22/5, 2013 at 11:7

SVM is a binary classifier by nature, but there are many alternatives that allow it to be apply to multi-label environments, basically by combining multiple binary instances of SVM.

Some examples are in the SVM Wikipedia article in the multi-class section. I am not sure if you are interested in the details, but they are included in Weka and Rapidminer. For example, the SMO classifier is one of the variations to apply SVM to multilabel problems.

Naive Bayes can be directly applied to multi-label environments.

Slalom answered 21/5, 2013 at 15:20 Comment(10)

I think you're confusing multiclass and multilabel classification. In the former, each sample has one class but there are more than 2 possibilities; in the latter, each sample can belong to multiple classes simultaneously. – Dorsy 21/5, 2013 at 15:26

OK, So let's choose NB. What should be the attributes in that scenario? I do have some ideas on that but I would like to hear the suggestion of someone with more experience than me. – Afterpiece 21/5, 2013 at 15:34

@larsmans Exactly. What I am asking is multi label classification, i.e. each sample can belong to multiple classes simultaneously. – Afterpiece 21/5, 2013 at 15:41

I wasn't confusing both concepts, although I have to admit that the explanation was not clear at all. By the way, in the case of some classifiers such as Naive-Bayes there is no need to divide it into multiple binary classifiers. For a given test document, the probability of each class given the document is computed (P(class|doc)). This information can be used for multi-label classification if thresholding strategies are used. – Slalom 21/5, 2013 at 17:39

But how do you learn the thresholds? That would require fitting another model on top of NB. – Dorsy 21/5, 2013 at 18:1

@user2295350: for documents, tf-idf-weighted term frequencies are the baseline approach. – Dorsy 21/5, 2013 at 18:1

@larsmans tf-idf sounds ok to me. So, if I understand correctly, I have to compute the tf-idf for each of the words in my corpus \ bag-of-words, always in regard to the input document. Then, I have to sort the results in descending order. Finally, I pick up the word that was ranked in the first place as the most probable to describe the input document. The word in the second place is considered to be the second most related to this document and so on, so forth. Is that right? – Afterpiece 22/5, 2013 at 7:44

@larsmans One of the common approaches for NB, kNN and, in some cases (when you output scores instead of a {-1, 1} decision) SVM is to obtain a score for each pair document-class. Using this approach, you do not need to generate N binary classifiers for NB and kNN. In all cases, they learn using a train set and the thresholds are optimised via cross-validation and/or using a validation set, where a quality metric (e.g., F1 is optimised). Once this is done, for each test document, the classifier produces a score per class, and if that value is higher than the threshold it is classified. – Slalom 22/5, 2013 at 8:57

Some references (I couldn't pasted them before because of lack of space). Lewis_2004 about one of the most well-known text classification collections (RCV1). He explains how to use the thresholding for SVM in a multi-label environment. [Yang_2001] (bradblock.com.s3-website-us-west-1.amazonaws.com/…) A study of thresholding strategies within text classification. – Slalom 22/5, 2013 at 9:0

@Afterpiece The documents are represented using a bag-of-words representation that is (usually) based in TF-IDF. A TF-IDF score for a word and a document measures how important or relevant the word is for the document based on how common the term is in the document (TF), and how rare it is in the collection (IDF). A document is represented as the set of TF-IDF for all terms (or a subset of them if you do feature selection). I strongly recommend to have a look at this [Text Classification Survey] to understand the process better (nmis.isti.cnr.it/sebastiani/Publications/ACMCS02.pdf) – Slalom 22/5, 2013 at 9:7

Can suggest you some tools which are extensions to weka that does multi-label classification.

MEKA: A Multi-label Extension to WEKA
Mulan: A Java library for multi-label learning

There is also a SVM lib extension SVMLib. If you are happy with python packages, scikit learning also provides one for Multi-label classification

Also, this recent paper in ICML 2013 "Efficient Multi-label Classification with Many Labels" should help you in implementation. If you want to implement one on your own.

Zacarias answered 16/9, 2013 at 8:59 Comment(0)

Recommended topics

Hot tags