One Class SVM algorithm taking too long
Asked Answered
R

1

8

The data bellow shows part of my dataset, that is used to detect anomalies

    describe_file   data_numbers    index
0   gkivdotqvj      7309.0          0
1   hpwgzodlky      2731.0          1
2   dgaecubawx      0.0             2
3   NaN             0.0             3
4   lnpeyxsrrc      0.0             4

I used One Class SVM algorithm to detect anomalies

from pyod.models.ocsvm import OCSVM
random_state = np.random.RandomState(42)     
outliers_fraction = 0.05
classifiers = {
        'One Classify SVM (SVM)':OCSVM(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, contamination=outliers_fraction)
}

X = data['data_numbers'].values.reshape(-1,1)   

for i, (clf_name, clf) in enumerate(classifiers.items()):
    clf.fit(X)
    # predict raw anomaly score
    scores_pred = clf.decision_function(X) * -1

    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X)
    n_inliers = len(y_pred) - np.count_nonzero(y_pred)
    n_outliers = np.count_nonzero(y_pred == 1)

    # copy of dataframe
    dfx = data[['index', 'data_numbers']]
    dfx['outlier'] = y_pred.tolist()
    IX1 =  np.array(dfx['data_numbers'][dfx['outlier'] == 0]).reshape(-1,1)
    OX1 =  dfx['data_numbers'][dfx['outlier'] == 1].values.reshape(-1,1)         
    print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)    
    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction) 

tOut = stats.scoreatpercentile(dfx[dfx['outlier'] == 1]['data_numbers'], np.abs(threshold))

y = dfx['outlier'].values.reshape(-1,1)
def severity_validation():
    tOUT10 = tOut+(tOut*0.10)    
    tOUT23 = tOut+(tOut*0.23)
    tOUT45 = tOut+(tOut*0.45)
    dfx['test_severity'] = "None"
    for i, row in dfx.iterrows():
        if row['outlier']==1:
            if row['data_numbers'] <=tOUT10:
                dfx['test_severity'][i] = "Low Severity" 
            elif row['data_numbers'] <=tOUT23:
                dfx['test_severity'][i] = "Medium Severity" 
            elif row['data_numbers'] <=tOUT45:
                dfx['test_severity'][i] = "High Severity" 
            else:
                dfx['test_severity'][i] = "Ultra High Severity" 

severity_validation()

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(dfx[['index','data_numbers']], dfx.outlier, test_size=0.25, 
                                                    stratify=dfx.outlier, random_state=30)

#Instantiate Classifier
normer = preprocessing.Normalizer()
svm1 = svm.SVC(probability=True, class_weight={1: 10})

cached = mkdtemp()
memory = Memory(cachedir=cached, verbose=3)
pipe_1 = Pipeline(steps=[('normalization', normer), ('svm', svm1)], memory=memory)

cv = skl.model_selection.KFold(n_splits=5, shuffle=True, random_state=42)

param_grid = [ {"svm__kernel": ["linear"], "svm__C": [0.5]}, {"svm__kernel": ["rbf"], "svm__C": [0.5], "svm__gamma": [5]} ]
grd = GridSearchCV(pipe_1, param_grid, scoring='roc_auc', cv=cv)

#Training
y_pred = grd.fit(X_train, Y_train).predict(X_test)
rmtree(cached)

#Evaluation
confmatrix = skl.metrics.confusion_matrix(Y_test, y_pred)
print(confmatrix)
Y_pred = grd.fit(X_train, Y_train).predict_proba(X_test)[:,1] 
def plot_roc(y_test, y_pred):
    fpr, tpr, thresholds = skl.metrics.roc_curve(y_test, y_pred, pos_label=1)
    roc_auc = skl.metrics.auc(fpr, tpr)
    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area ={0:.2f})'.format(roc_auc))
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show();
plot_roc(Y_test, Y_pred) 


My dataset is quite big, going to the millions of rows. As a result I can only run a couple hundred thousands of rows. The code works just fine, however it just takes too long so I am hoping to maybe get some advice to optimize is so I runs faster.

Roulade answered 17/3, 2020 at 14:19 Comment(8)
rbf kernel will run forever on anything larger than several tens of thousand rows. Change kernel. Change algo. Buy more powerful machine.Backstroke
Look at EllipticEnvelope or IsolationForest they both are pretty fast algos for anomaly/outlier detectionBackstroke
@Sergey Bushmanov, I will give these two other algorithms a try. Regarding this can you give me an answer on what would you change so it works just a tiny bit faster ?Roulade
I am not familiar with pyod (od for outlier detection?), but sklearn's SVM has other than rbf kernels. I would start with linear, see if that satisfies you, and proceed to more complex kernels. Concerning the algos. I would start with trying to understand what constitutes an outlier for an 1d distribution (it's 1d, right?). If it's normal, calculating σ, and seeing what is further than 2-3σ's from mean would be enough. Even an envelope would be overkill here. If it's not normal, I would try to investigate what would be considered an outlier for that type of distribution.Backstroke
If you insist on One class SVM with rbf kernel, for some reason, training on representative sample of couple of hundred thousands samples and then predicting outlier is also not bad at all.Backstroke
@Sergey Bushmanov, If I would go with ```linear```` kernel which lines do I have to change ?Roulade
OCSVM(kernel='linear'Backstroke
A linear kernel on data after the sklearn.kernel_approximation.Nystroem transformation, performed much faster and near equivalent accuracy for me on a classification task. I would suggest that.Sesquialtera
W
2

SVM training time scales badly with number of samples, typically O(n^2) or worse. So it is not suitable for datasets with millions of samples. Some example code for exploring the can be found here.

I would recommend trying instead IsolationForest, it is fast and performant.

If you want to use SVM, subsample your dataset such that you have 10-100k samples. The linear kernel will also be significantly faster to train than RBF, but will still have poor scalability with large number of samples.

Widthwise answered 21/3, 2020 at 15:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.