How best to deal with "None of the above" in Image Classification?

Asked 24/4, 2017 at 2:7 Answered 24/4, 2017 at 16:0

This seems to be a fundamental question which some of you out there must have an opinion on. I have an image classifier implemented in CNTK with 48 classes. If the image does not match any of the 48 classes very well, then I'd like to be able to conclude that it was not among these 48 image types. My original idea was simply that if the highest output of the final Softmax layer was low, I would be able to conclude that the test image matched none well. While I occasionally see this occur, in most testing, Softmax still produces a very high (and mistaken) result when handed an 'unknown image type'. But maybe my network is 'over fit' and if it wasn't, my original idea would work fine. What do you think? Any way to define a 49-th class called 'none-of-the-above'?

Hypabyssal answered 24/4, 2017 at 2:7 Comment(1)

any thoughts on the answers below? Please comment and vote. – Cineaste 27/4, 2017 at 23:55

You really have these two options indeed--thresholding the posterior probabilities (softmax values), and adding a garbage class.

In my area (speech), both approaches are their place:

If "none of the above" inputs are of the same nature as the "above" (e.g. non-grammatical inputs), thresholding works fine. Note that the posterior probability for a class is equal to one minus an estimate of the error rate for choosing this class. Rejecting anything with posterior < 50% would be rejecting all cases where you are more likely wrong than right. As long as your none-of-the-above classes are of similar nature, the estimate may be accurate enough to make this correct for them as well.

If "none of the above" inputs are of similar nature but your number of classes is very small (e.g. 10 digits), or if the inputs are of a totally different nature (e.g. a sound of a door slam or someone coughing), thresholding typically fails. Then, one would train a "garbage model." In our experience, it is OK to include the training data for the correct classes. Now the none-of-the-above class may match a correct class as well. But that's OK as long as the none-of-the-above class is not overtrained--its distribution will be much flatter, and thus even if it matches a known class, it will match it with a lower score and thus not win against the actual known class' softmax output.

In the end, I would use both. Definitely use a threshold (to catch the cases that the system can rule out) and use a garbage model, which I would just train it on whatever you have. I would expect that including the correct examples in training will not harm, even if it is the only data you have (please check the paper Anton posted for whether that applies to image as well). It may also make sense to try to synthesize data, e.g. by randomly combining patches from different images.

Oleomargarine answered 24/4, 2017 at 16:0 Comment(1)

Frank, do you suggest thresholding with an exact number based on the results from the already seen data? I have a classifier with 5-9 classes that takes a string and with a scoring function produces a number (from 0 to 100) for each one of the classes and i pick the maximum. Ideally that maximum is significantly higher than the rest of the classes. I want to add the none-of-the-above class but in deciding what threshold to use I am not sure should I base the threshold only on the current 5 values (and if the one is significantly higher) or on all the observed data from the training set? – Schmaltz 14/7, 2017 at 17:31

I agree with you that this is a key question, but I am not aware of much work in that area either.

There's one recent paper by Zhang and LeCun, that addresses the question for image classification in particular. They use large quantities of unlabelled data to create an additional "none of the above" class. The catch though is that, in some cases, their unlabelled data is not completely unlabelled, and they have means of removing "unlabelled" images that are actually in one of their labelled classes. Having said that, the authors report that apart from solving the "none of the above" problem, they even see performance gains even on their test sets.

As for fitting something post-hoc, just by looking at the outputs of the softmax, I can't provide any pointers.

Cineaste answered 24/4, 2017 at 11:32 Comment(0)

Recommended topics

Hot tags