How are feature_importances in RandomForestClassifier determined?
Asked Answered
S

7

149

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_, which works well for me.

However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic.

Somersomers answered 4/4, 2013 at 11:53 Comment(1)
Woah three core devs on in one SO thread. That's gotta be some kind of record ^^Jeffryjeffy
S
181

There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.

In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.

(Note that both algorithms are available in the randomForest R package.)

[1]: Breiman, Friedman, "Classification and regression trees", 1984.

Shamekashameless answered 4/4, 2013 at 21:16 Comment(9)
It could be great if this answer was mentioned in the documentation of the importance attributes/example. Been searching for it for awhile too :)Riesman
It seems the importance score is in relative value? For example, the sum of the importance scores of all features is always 1 (see the example here scikit-learn.org/stable/auto_examples/ensemble/…)Doxology
@RNA: Yes, by default variable importances are normalized in scikit-learn, such that they sum to one. You can circumvent this by looping over the individual base estimators and calling tree_.compute_feature_importances(normalize=False).Shamekashameless
@GillesLouppe Do you use the out of bag samples to measure the reduction in MSE for a forest of decision tree regressors in each tree? Or all training data used on the tree?Kania
Two useful resources. (1) blog.datadive.net/… a blog by Ando Saabas implements both "mean decrease impurity" and also "mean decrease in accuracy" as mentioned by Gilles. (2) Download and read Gilles Louppe's thesis.Contagion
"often cited, but unfortunately rarely read": it's not open access anywhere on the internetFevre
FYI, there is a pretty good document in version 0.22 of sklearn on Permutation Importance vs Random Forest Feature Importance (MDI)Sugary
"reaching that node". which node?Faris
@Cokes, based on scikit-learn.org/stable/auto_examples/inspection/… Aug. 2023 the reduction in metric is measured using training statistics: "impurity-based importances are computed on training set statistics and therefore do not reflect the ability of feature to be useful to make predictions that generalize to the test set". Htis is a major problem in my eyes.Conquistador
B
58

The usual way to compute the feature importance values of a single tree is as follows:

  1. you initialize an array feature_importances of all zeros with size n_features.

  2. you traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy, MSE, ...). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

Its important that these values are relative to a specific dataset (both error reduction and the number of samples are dataset specific) thus these values cannot be compared between different datasets.

As far as I know there are alternative ways to compute feature importance values in decision trees. A brief description of the above method can be found in "Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Behka answered 4/4, 2013 at 19:32 Comment(0)
I
16

It's the ratio between the number of samples routed to a decision node involving that feature in any of the trees of the ensemble over the total number of samples in the training set.

Features that are involved in the top level nodes of the decision trees tend to see more samples hence are likely to have more importance.

Edit: this description is only partially correct: Gilles and Peter's answers are the correct answer.

Ilona answered 4/4, 2013 at 12:22 Comment(2)
Do you know if there is some paper/documentation about the exact method? eg. Breiman, 2001. It would be great if I had some proper document, which I could cite for the methodology.Somersomers
@Ilona it would be great if you could clearly mark your response as the explanation for the "weighting". The weighting alone does not determine the feature importance. The "impurity metric" ("gini-importance" or RSS) combined with the weights, averaged over trees determines the overall feature importance. Unfortunately the documentation on scikit-learn here: scikit-learn.org/stable/modules/… is not accurate and incorrectly mentions "depth" as the impurity metric.Torietorii
K
14

As @GillesLouppe pointed out above, scikit-learn currently implements the "mean decrease impurity" metric for feature importances. I personally find the second metric a bit more interesting, where you randomly permute the values for each of your features one-by-one and see how much worse your out-of-bag performance is.

Since what you're after with feature importance is how much each feature contributes to your overall model's predictive performance, the second metric actually gives you a direct measure of this, whereas the "mean decrease impurity" is just a good proxy.

If you're interested, I wrote a small package that implements the Permutation Importance metric and can be used to calculate the values from an instance of a scikit-learn random forest class:

https://github.com/pjh2011/rf_perm_feat_import

Edit: This works for Python 2.7, not 3

Kaiserdom answered 6/1, 2017 at 22:25 Comment(2)
Hi @Kaiserdom when I use your code I get this error: NameError: name 'xrange' is not defined.Gallagher
Hi @Aizzaac. Sorry I'm new to writing packages, so I should've noted I wrote it for Python 2.7. Try def xrange(x): return iter(range(x)) before running itKaiserdom
K
4

code:

iris = datasets.load_iris()  
X = iris.data  
y = iris.target  
clf = DecisionTreeClassifier()  
clf.fit(X, y)  

decision_tree plot:
enter image description here
We get

compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071]   

Check source code:

cpdef compute_feature_importances(self, normalize=True):
    """Computes the importance of each feature (aka variable)."""
    cdef Node* left
    cdef Node* right
    cdef Node* nodes = self.nodes
    cdef Node* node = nodes
    cdef Node* end_node = node + self.node_count

    cdef double normalizer = 0.

    cdef np.ndarray[np.float64_t, ndim=1] importances
    importances = np.zeros((self.n_features,))
    cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data

    with nogil:
        while node != end_node:
            if node.left_child != _TREE_LEAF:
                # ... and node.right_child != _TREE_LEAF:
                left = &nodes[node.left_child]
                right = &nodes[node.right_child]

                importance_data[node.feature] += (
                    node.weighted_n_node_samples * node.impurity -
                    left.weighted_n_node_samples * left.impurity -
                    right.weighted_n_node_samples * right.impurity)
            node += 1

    importances /= nodes[0].weighted_n_node_samples

    if normalize:
        normalizer = np.sum(importances)

        if normalizer > 0.0:
            # Avoid dividing by zero (e.g., when root is pure)
            importances /= normalizer

    return importances

Try calculate the feature importance:

print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))

We get feature_importance: np.array([0,1.332,6.418,92.30]).

After normalized, we get array ([0., 0.01331334, 0.06414793, 0.92253873]),this is same as clf.feature_importances_.

Be careful all classes are supposed to have weight one.

Kktp answered 6/8, 2018 at 7:59 Comment(0)
C
3

For those looking for a reference to the scikit-learn's documentation on this topic or a reference to the answer by @GillesLouppe:

In RandomForestClassifier, estimators_ attribute is a list of DecisionTreeClassifier (as mentioned in the documentation). In order to compute the feature_importances_ for the RandomForestClassifier, in scikit-learn's source code, it averages over all estimator's (all DecisionTreeClassifer's) feature_importances_ attributes in the ensemble.

In DecisionTreeClassifer's documentation, it is mentioned that "The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [1]."

Here is a direct link for more info on variable and Gini importance, as provided by scikit-learn's reference below.

[1] L. Breiman, and A. Cutler, “Random Forests”, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Costa answered 13/8, 2018 at 19:40 Comment(0)
I
-1

Feature Importance in Random Forest

  • Random forest uses many trees, and thus, the variance is reduced
  • Random forest allows far more exploration of feature combinations as well
  • Decision trees gives Variable Importance and it is more if there is reduction in impurity (reduction in Gini impurity)
  • Each tree has a different Order of Importance

    Here is what happens in the background!
  • We take an attribute and check in all the trees where it is present and take the average values of the change in the homogeneity on this attribute split. This average value of change in the homogeneity gives us the feature importance of the attribute enter image description here
Ilowell answered 29/1, 2022 at 16:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.