parallel generation of random forests using scikit-learn

Asked 18/9, 2014 at 13:39 Answered 19/9, 2014 at 3:29

Solved python r scikit-learn random-forest elastic-map-reduce

Main question: How do I combine different randomForests in python and scikit-learn?

I am currently using the randomForest package in R to generate randomforest objects using elastic map reduce. This is to address a classification problem.

Since my input data is too large to fit in memory on one machine, I sample the data into smaller data sets and generate random forest object which contains a smaller set of trees. I then combine the different trees together using a modified combine function to create a new random forest object. This random forest object contains the feature importance and final set of trees. This does not include the oob errors or votes of the trees.

While this works well in R, I want to do the same thing in Python using scikit-learn. I can create different random forest objects, but I don't have any way to combine them together to form a new object. Can anyone point me to a function that can combine the forests? Is this possible using scikit-learn?

Here is the link to a question on how to this process in R:Combining random forests built with different training sets in R .

Edit: The resulting random forest object should contain the trees that can be used for prediction and also the feature importance.

Any help would be appreciated.

Boomerang answered 18/9, 2014 at 13:39 Comment(4)

If the goal is prediction then there is no necessity to combine different models. You can make prediсtion by separate models and then combine results only. – Personate 18/9, 2014 at 16:33

Agree with @DrDom, there are many ways to ensemble models. Details on how you want to do it are pretty important. – Stretcher 18/9, 2014 at 16:59

@Personate I agree that if it was just predictions then I can combine the results. But, I am interested in not only predictions but also the variable importance of the features. – Boomerang 18/9, 2014 at 17:35

@reddy, variable importance is an average change in prediction error while the variable is shuffled. Thus the average importance across separate models should be approximately equal to the value of variable importance for ensemble of random forests. This is valid if variable importance values were not previously scaled or modified in other ways. Anyway variable importance is not a fixed number since its value depends on the random numbers. UPD: if the number of trees in each model is different then when you calc the average importance there is a need to took this into account. – Personate 18/9, 2014 at 18:42

Sure, just aggregate all the trees, for instance have look at this snippet from pyrallel:

def combine(all_ensembles):
    """Combine the sub-estimators of a group of ensembles

        >>> from sklearn.datasets import load_iris
        >>> from sklearn.ensemble import ExtraTreesClassifier
        >>> iris = load_iris()
        >>> X, y = iris.data, iris.target

        >>> all_ensembles = [ExtraTreesClassifier(n_estimators=4).fit(X, y)
        ...                  for i in range(3)]
        >>> big = combine(all_ensembles)
        >>> len(big.estimators_)
        12
        >>> big.n_estimators
        12
        >>> big.score(X, y)
        1.0

    """
    final_ensemble = copy(all_ensembles[0])
    final_ensemble.estimators_ = []

    for ensemble in all_ensembles:
        final_ensemble.estimators_ += ensemble.estimators_

    # Required in old versions of sklearn
    final_ensemble.n_estimators = len(final_ensemble.estimators_)

    return final_ensemble

Pr answered 19/9, 2014 at 3:29 Comment(3)

Thanks for the info. I will try this out. By the way will the final_ensemble be a random forest object (in this case) and how will the feature_importances_ attribute be handled in this process. – Boomerang 19/9, 2014 at 19:27

This is a Python property: it will be automatically be recomputed upon access: github.com/scikit-learn/scikit-learn/blob/master/sklearn/… – Pr 20/9, 2014 at 19:13

Thanks, but this method does not work when one of the random forests has been trained on more labels than the other. – Veteran 29/3, 2015 at 16:27

Based on your edit, it sounds like you're only asking for how to extract feature importance and look at the individual trees used in a random forest. If so, both of these are attributes of your random forest model named "feature_importances_" and "estimators_" respecitively. An example illustrating this can be found below:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_blobs
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)
>>> clf = RandomForestClassifier(n_estimators=5, max_depth=None, min_samples_split=1, random_state=0)
>>> clf.fit(X,y)
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=1,
            n_estimators=5, n_jobs=1, oob_score=False, random_state=0,
            verbose=0)
>>> clf.feature_importances_
array([ 0.09396245,  0.07052027,  0.09951226,  0.09095071,  0.08926362,
        0.112209  ,  0.09137607,  0.11771107,  0.11297425,  0.1215203 ])
>>> clf.estimators_
[DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b408>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b3f0>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b420>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b438>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b450>,
            splitter='best')]

Stretcher answered 18/9, 2014 at 20:36 Comment(0)

Recommended topics

Hot tags