Is there a way to perform grid search hyper-parameter optimization on One-Class SVM
Asked Answered
J

2

15

Is there a way to use GridSearchCV or any other built-in sklearn function to find the best hyper-parameters for OneClassSVM classifier?

What I currently do, is perform the search myself using train/test split like this:

Gamma and nu values are defined as:

gammas = np.logspace(-9, 3, 13)
nus = np.linspace(0.01, 0.99, 99)

Function which explores all possible hyper-parameters and finds the best ones:

clf = OneClassSVM()

results = []

train_x = vectorizer.fit_transform(train_contents)
test_x = vectorizer.transform(test_contents)

for gamma in gammas:
    for nu in nus:
        clf.set_params(gamma=gamma, nu=nu)

        clf.fit(train_x)

        y_pred = clf.predict(test_x)

        if 1. in y_pred:  # Check if at least 1 review is predicted to be in the class
            results.append(((gamma, nu), (accuracy_score(y_true, y_pred),
                                              precision_score(y_true, y_pred),
                                              recall_score(y_true, y_pred),
                                              f1_score(y_true, y_pred),
                                              roc_auc_score(y_true, y_pred),
                                              ))
                               )

    # Determine and print the best parameter settings and their performance
    print_best_parameters(results, best_parameters(results))

Results are stored in a list of tuples of form:

((gamma, nu)(accuracy_score, precision_score, recall_score, f1_score, roc_auc_score))

To find the best accuracy, f1, roc_auc scores and parameters I wrote my own function:

best_parameters(results)

Joachima answered 22/6, 2017 at 12:3 Comment(4)
Have you tried it with GridSearchCV? Are you getting any errors?Ubiquitarian
How do I do that without applying cross-validation, because One-Class SVM only needs to be fitted to the data which belongs to the class that the classifier is working on. What I do is: train on 80% of instances which belong to the class, then I combine the rest 20% with instances that don't belong to the class and use those for testing.Joachima
How are you dividing the data into train and test?Ubiquitarian
@Joachima could you please share how you solved this issue with OC-SVM. I am struggling with the same problem and I'm not sure how I have to combine your question with the answer to get it work.Commitment
C
12

I ran into this same problem and found this question while searching for a solution. I ended up finding a solution that uses GridSearchCV and am leaving this answer for anyone else who searches and finds this question.

The cv parameter of the GridSearchCV class can take as its input an iterable yielding (train, test) splits as arrays of indices. You can generate splits that use only data from the positive class in the training folds, and the remaining data in the positive class plus all data in the negative class in the testing folds.

You can use sklearn.model_selection.KFold to make the splits

from sklearn.model_selection import KFold

Suppose Xpos is an nXp numpy array of data for the positive class for the OneClassSVM and Xneg is an mXp array of data for known anomalous examples.

You can first generate splits for Xpos using

splits = KFold(n_splits=5).split(Xpos)

This will construct a generator of tuples of the form (train, test) where train is a numpy array of int containing indices for the examples in a training fold and test is a numpy array containing indices for examples in a test fold.

You can then combine Xpos and Xneg into a single dataset using

X = np.concatenate([Xpos, Xneg], axis=0)

The OneClassSVM will make prediction 1.0 for examples it thinks are in the positive class and prediction -1.0 for examples it thinks are anomalous. We can make labels for our data using

y = np.concatenate([np.repeat(1.0, len(Xpos)), np.repeat(-1.0, len(Xneg))])

We can then make a new generator of (train, test) splits with indices for the anomalous examples included in the test folds.

n, m = len(Xpos), len(Xneg)

splits = ((train, np.concatenate([test, np.arange(n, n + m)], axis=0)
          for train, test in splits)

You can then pass these splits to GridSearchCV using the data X, y and whatever scoring method and other parameters you wish.

grid_search = GridSearchCV(estimator, param_grid, cv=splits, scoring=...)

Edit: I hadn’t noticed that this approach was suggested in the comments of the other answer by Vivek Kumar, and that the OP had rejected it because they didn’t believe it would work with their method of choosing the best parameters. I still prefer the approach I’ve described because GridSearchCV will automatically handle multiprocessing and provides exception handling and informative warning and error messages.

It is also flexible in the choice of scoring method. You can use multiple scoring methods by passing a dictionary mapping strings to scoring callables and even define custom scoring callables. This is described in the Scikit-learn documentation here. A bespoke method of choosing the best parameters could likely be implemented with a custom scoring function. All of the metrics used by the OP could be included using the dictionary approach described in the documentation.

You can find a real world example here. I'll make a note to change the link when this gets merged into master.

Codfish answered 18/10, 2019 at 23:44 Comment(3)
@asj3 Does the line splits = ((train, np.concatenate([test, np.arange(n, n + m), axis=0) for train, test in splits) have a syntax error?Geognosy
@Geognosy Good catch. There was a missing bracket. Fixed nowCodfish
@asj3 I also don't think you can pass X and y to GridSearchCV. It is passed later to grid_Search.fit(X,y)Geognosy
U
5

Yes, there is a way to search over hyper-parameters without performing cross-validation over input data. This method is called ParameterGrid() and is stored in sklearn.model_selection. Here is the link to the official documentation:

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html

Your case might look like the following:

grid = {'gamma' : np.logspace(-9, 3, 13),
        'nu' : np.linspace(0.01, 0.99, 99)}

To assert all the steps possible with the grid you may type list(ParameterGrid(grid)). We may also check its length via len(list(ParameterGrid(grid))) which totally gives 1287 and thus 1287 models to fit on the train data.

To use the method you necessarily need a for loop for that. Implying you have clf variable as you unfitted one-class SVM imported from sklearn.svm the loop will look something like below:

for z in ParameterGrid(grid):
    clf.set_params(**z)
    clf.fit(X_train, y_train)
    clf.predict(X_test)
    ...

I hope that suffices. Do not forget that names in grid should be coherent with parameter of one-class SVM. To get the names of these parameters you may type clf.get_params().keys(), and there you would see you 'gamma' and 'nu'.

Ultracentrifuge answered 23/6, 2017 at 5:45 Comment(5)
This solution is good. But then again, the OP has to maintain all the information about the scores, fits, parameters etc. GridSearchCV will do that automatically. And since the user is dividing the data into train and test, then we can use a custom cv iterator which will split the data accordingly.Ubiquitarian
It is slightly confusing for me, too. I would do the same thing as you have pointed out. I am not sure, though, if this for loop is more time-consuming than basic GridSearchCV, or if they are almost equal.Ultracentrifuge
I cant say for sure about this for loop, but GridSearchCV will parallelize the internal fitting of different parameters, so maybe that will have slightly higher performance than this/Ubiquitarian
Oh, yeah. For sure it would be faster.Ultracentrifuge
It reduces nesting by one indentation. Performance seems to be about the same. However, that is not very useful as I still have to use my own implementation for finding the best hyperparameters.Joachima

© 2022 - 2024 — McMap. All rights reserved.