How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)
Asked Answered
M

2

8

I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?

Melchor answered 6/8, 2013 at 10:49 Comment(3)
Is up-sampling the minority class an option?Dunford
Could you tell more about what do you mean with up-sampling?Melchor
possible duplicate of sklearn logistic regression with unbalanced classesTaskmaster
H
25

The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a given C is to put

C1 = C / n1
C2 = C / n2

where n1 and n2 are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.

Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.

Example using python and sklearn

print __doc__

import numpy as np
import pylab as pl
from sklearn import svm

# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
          0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)

# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]


# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)

ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]

# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()

pl.axis('tight')
pl.show()

In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'.

Visualization of above code from sklearn documentation

Handtomouth answered 6/8, 2013 at 18:47 Comment(2)
Thank you very much, it's what I am looking for. I wish I had 15 points to vote for this answer :)Melchor
I am pretty sure that you can still check the "accept answer" option :)Handtomouth
L
1

This paper describes a variety of techniques. One simple (but very bad method for SVM) is just replicating the minority class(s) until you have a balance:

http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf

Lida answered 12/6, 2015 at 21:31 Comment(2)
Just for completeness - replicating minority class should never be used in SVM. It is equivalent to using class weights, while in the same time is completely inefficient in terms of training (and testing) times.Handtomouth
I edited my original answer to reflect lejlot's comment.Lida

© 2022 - 2024 — McMap. All rights reserved.