AUC-base Features Importance using Random Forest

Asked 8/7, 2015 at 9:45 Answered 8/7, 2015 at 17:19

Solved python machine-learning scikit-learn scoring

I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).

The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).

The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.

My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?

PS : This question is kind of linked with an other.

Selfcongratulation answered 8/7, 2015 at 9:45 Comment(0)

After doing some researchs, this is what I came out with :

from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict

names = db_train.iloc[:,1:].columns.tolist()

# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
                                 class_weight="auto",
                                 criterion='gini',
                                 bootstrap=True,
                                 max_features=10,
                                 min_samples_split=1,
                                 min_samples_leaf=6,
                                 max_depth=3,
                                 n_jobs=-1)
scores = defaultdict(list)

# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))

for i in range(X_train.shape[1]):
    X_t = X_test.copy()
    np.random.shuffle(X_t[:, i])
    shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
    scores[names[i]].append((acc-shuff_acc)/acc)

print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True))

Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]

The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.

I took my inspiration from this (great) notebook.

All suggestions/comments are most welcome !

Selfcongratulation answered 8/7, 2015 at 17:19 Comment(2)

Need to import roc_auc_score: from sklearn.metrics import roc_auc_score – Kantar 21/10, 2019 at 14:5

This is standard permutation importance - available as a function e.g. in the eli5 package. – Unrig 31/10, 2022 at 10:20

scoring is just a performance evaluation tool used in test sample, and it does not enter into the internal DecisionTreeClassifier algo at each split node. You can only specify the criterion (kind of internal loss function at each split node) to be either gini or information entropy for the tree algo.

scoring can be used in a cross-validation context where the goal is to tune some hyperparameters (like max_depth). In your case, you can use a GridSearchCV to tune some of your hyperparameters using the scoring function roc_auc.

Credo answered 8/7, 2015 at 10:44 Comment(1)

I edited my initial question which was confused (because I was myself). Hope it's clearer now!. – Selfcongratulation 8/7, 2015 at 14:37