Dealing with class imbalance in multi-label classification
Asked Answered
D

3

10

I've seen a few questions on class imbalance in a multiclass setting. However, I have a multi-label problem, so how would you deal with it in this case?

I have a set of around 300k text examples. As mentioned in the title, each example has at least one label, and there are only 100 possible unique labels. I've reduced this problem down to binary classification for Vowpal Wabbit by taking advantage of namespaces, e.g.

From:

healthy fruit | bananas oranges jack fruit
evil monkey | bipedal organism family guy
...  

To:

1 |healthy bananas oranges jack fruit
1 |fruit bananas oranges jack fruit
0 |evil bananas oranges jack fruit
0 |monkey bananas oranges jack fruit
0 |healthy bipedal organism family guy
0 |fruit bipedal organism family guy
1 |evil bipedal organism family guy
1 |monkey bipedal organism family guy
...  

I'm using the default options provided by VW (which I think is online SGD, with the squared loss function). I'm using the squared loss because it closely resembles the Hamming Loss.

After training, when testing on the same training set, I've noticed that all examples were predicted with the '0' label... which is one way of minimizing loss, I guess. At this point, I'm not sure what to do. I was thinking of using cost-sensitive one-against-all classification to try to balance the classes, but reducing multi-label to multi-class is unfeasible since there exists 2^100 label combinations. I'm wondering if anyone else have any suggestions.

Edit: I finally had the chance to test out class-imbalance, specifically for vw. vw handles imbalance very badly, at least for highly-dimensional, sparsely-populated text features. I've tried ratios from 1:1, to 1:25, with performance degrading abruptly at the 1:2 ratio.

Dorkas answered 9/12, 2013 at 0:55 Comment(3)
I can get rid of the 0 labels entirely. And the labels are namespaces in the binary reduction.Dorkas
Were you able to find answer to your question? Doesn't look like we have a solid answer yet.Tao
@ML_Pro See my answer: use --loss_function logistic.Allman
A
7

Any linear model will handle class imbalance "very badly" if you force it to use squared loss for a binary classification problem. Think about the loss function: if 99% of observations are zero, predicting 0 in all cases gives a squared error of 0.01. Vowpal Wabbit can't do magic: if you ask it to minimize squared error loss, it will indeed minimize squared error loss, as will any other regression program.

Here's a demonstration of the same "problem" with a linear regression model in R:

set.seed(42)
rows <- 10000
cols <- 100
x <- matrix(sample(0:1, rows*cols, replace=TRUE), nrow=rows)
y <- x %*% runif(cols) + runif(rows)
y <- ifelse(y<quantile(y, 0.99), 0, 1)
lin_mod <- glm(y~., data.frame(y, x), family='gaussian') #Linear model
log_mod <- glm(factor(y)~., data.frame(y, x), family='binomial') #Logistic model

Comparing predictions from a linear vs logistic model shows that the linear model always predicts 0 and the logistic model predicts the correct mix of 0's and 1's:

> table(ifelse(predict(lin_mod, type='response')>0.50, 1, 0))

    0 
10000 
> table(ifelse(predict(log_mod, type='response')>0.50, 1, 0))

   0    1 
9900  100 

Use --loss_function="logistic" or --loss_function="hinge" for binary classification problems in vowpal wabbit. You can evaluate your predictions after the fact using Hamming loss, but it may be informative to compare your results to the Hamming loss of always predicting 0.

Allman answered 31/3, 2014 at 16:54 Comment(3)
Any particular reason why linear model is worse than logistic at imbalanced classification? Or minimizing squared loss worse than minimizing cross entropy (maximize log-likelilood)? Frankly speaking, the only thing I could think of why most of the models do a poor job on imbalanced classification is that, they try to minimize the total loss over the training data, if we get wrong on most of the majority class examples, the loss could be high, whereas if wrong on the most of the minority class examples, the loss delta incurred is negligible.Hankypanky
@Hankypanky It depends on what you want out of the model. Note the quote from the original question "I've noticed that all examples were predicted with the '0' label... which is one way of minimizing loss, I guess". My point was simply that minimizing rmse will tend to give this result. If you don't want this, you need to use another loss function.Allman
@Hankypanky hah, no problem. You can +1 my comment if you like it :-DAllman
A
1

In general, if you're looking to account for a class imbalance in your training data it means you have to change to a better suited loss function. Specifically for class imbalance, you want to change your loss function to area under the ROC curve. Specifically designed to account for this issue.

There's a multi-label version, but if you've already reduced it to binary classification it should just work out of the box.

Here's a wikipedia article explaining the concept more fully.

And here's the relevant sklearn documentation, which might less helpful since I'm not sure what language this is happening in.

Ashe answered 9/12, 2013 at 1:11 Comment(2)
AUC is not designed "specifically" for imbalanced datasets. It is about postponing the decision about Precision/Recall tradeoff (until some domain expert tell you what's the cost between false positives vs. false negatives). If you know the required levels of Precision/Recall you don't need AUC for model selection. Having imbalanced dataset just requires monitoring two quantities instead of one precision/recall, sensitivity/specificity etc. Summarising to one qunatity like AUC or F-score can easily mislead you. The problem in question is totally different.Artist
@Artist In fact it is not. I may have oversimplified a bit, but auc as a metric is specifically chosen to root out issues of random guessing and class imbalance, when simple accuracy fails in these respects. When you train a model with a serious imbalance, and are optimizing for accuracy, a model quickly converges on only selecting a single class, as happened in the question. If instead one uses AUC as an evaluation metric rather than accuracy this problem disappears. If you are unconvinced, think about what happens when you randomly guess, or guess all of one number.Ashe
N
1

I take it you have reduced the problem into 100, binary classification problems? That would be a standard way to do things in the multilabel setting.

If your evaluation metric really is the Hamming loss, then you might actually be better off predicting just the majority for each binary problem. Hard to beat that for highly imbalanced problems. But in most cases your evaluation metric itself is different. For example you may want to optimize the F1 measure (micro or macro). In such cases you can try to somehow balance the +ve and -ve samples for each binary problem. There are a few ways of doing this.

As Slater mentioned you could try to optimize AUC for each of the learning problems. In which case you will learn a real valued function taking an instance as input. Now instead of thresholding at a default value (which is usually 0) you can threshold it at a different value and try the performance.

In fact you can try the 'different' thresholding for even the normal least squares thingy which you have optimized. This threshold though is crucial and you will have to choose it via cross validation.

Also, you could not change the threshold, but change the 'weights' of the examples in the different learning problems. For example if you find the 'healthy' label occuring in 1k samples and not occuring in 29k samples, just use a weight of 29 for the examples with the 'healthy' label and a weight of 1 for examples without the label.

I dunno how you'd do this in VW though. You'll have to figure it out.

Nertie answered 16/12, 2013 at 22:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.