Why is Random Forest with a single tree much better than a Decision Tree classifier?

Asked 13/1, 2018 at 11:4 Answered 18/6, 2024 at 20:36

Solved python machine-learning scikit-learn random-forest decision-tree

I apply the decision tree classifier and the random forest classifier to my data with the following code:

def decision_tree(train_X, train_Y, test_X, test_Y):

    clf = tree.DecisionTreeClassifier()
    clf.fit(train_X, train_Y)

    return clf.score(test_X, test_Y)


def random_forest(train_X, train_Y, test_X, test_Y):
    clf = RandomForestClassifier(n_estimators=1)
    clf = clf.fit(X, Y)

    return clf.score(test_X, test_Y)

Why the result are so much better for the random forest classifier (for 100 runs, with randomly sampling 2/3 of data for the training and 1/3 for the test)?

100%|███████████████████████████████████████| 100/100 [00:01<00:00, 73.59it/s]
Algorithm: Decision Tree
  Min     : 0.3883495145631068
  Max     : 0.6476190476190476
  Mean    : 0.4861783113770316
  Median  : 0.48868030937802126
  Stdev   : 0.047158171852401135
  Variance: 0.0022238931724605985
100%|███████████████████████████████████████| 100/100 [00:01<00:00, 85.38it/s]
Algorithm: Random Forest
  Min     : 0.6846846846846847
  Max     : 0.8653846153846154
  Mean    : 0.7894823428836184
  Median  : 0.7906101571063208
  Stdev   : 0.03231671150915106
  Variance: 0.0010443698427656967

The random forest estimators with one estimator isn't just a decision tree? Have i done something wrong or misunderstood the concept?

Dugong answered 13/1, 2018 at 11:4 Comment(1)

It depends on the parameters you use for the random forest. random forest is meant to use many trees. it is not efficient. Xgboost works on error correction with many trees. It is the strategy to reduce error that is the goal not efficiency. – Gaylord 16/2, 2022 at 21:48

The random forest estimators with one estimator isn't just a decision tree?

Well, this is a good question, and the answer turns out to be no; the Random Forest algorithm is more than a simple bag of individually-grown decision trees.

Apart from the randomness induced from ensembling many trees, the Random Forest (RF) algorithm also incorporates randomness when building individual trees in two distinct ways, none of which is present in the simple Decision Tree (DT) algorithm.

The first is the number of features to consider when looking for the best split at each tree node: while DT considers all the features, RF considers a random subset of them, of size equal to the parameter max_features (see the docs).

The second is that, while DT considers the whole training set, a single RF tree considers only a bootstrapped sub-sample of it; from the docs again:

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

The RF algorihm is essentially the combination of two independent ideas: bagging, and random selection of features (see the Wikipedia entry for a nice overview). Bagging is essentially my second point above, but applied to an ensemble; random selection of features is my first point above, and it seems that it had been independently proposed by Tin Kam Ho before Breiman's RF (again, see the Wikipedia entry). Ho had already suggested that random feature selection alone improves performance. This is not exactly what you have done here (you still use the bootstrap sampling idea from bagging, too), but you could easily replicate Ho's idea by setting bootstrap=False in your RandomForestClassifier() arguments. The fact is that, given this research, the difference in performance is not unexpected...

To replicate exactly the behaviour of a single tree in RandomForestClassifier(), you should use both bootstrap=False and max_features=None arguments, i.e.

clf = RandomForestClassifier(n_estimators=1, max_features=None, bootstrap=False)

in which case neither bootstrap sampling nor random feature selection will take place, and the performance should be roughly equal to that of a single decision tree.

Crucial answered 13/1, 2018 at 11:59 Comment(1)

Could you provide an example using np.array_equal() to compare probabilities from RF and DT? I have been trying to reproduce what you are saying and comparing it, but I don't find a True statement using np.array_equal(). I made a question related to his comment. – Phytology 7/4, 2022 at 18:14

That's a great observation! Although, it seems difficult to achieve in the same conditions as desertnaut explained, this improvement can be achieved by reformulating the task using a Pairwise Difference Classifier.

The PDL classifier works by pairing data points and predicting their similarities. This approach simplifies the training process while maintaining high performance.

In fact, with a single tree, the PDL Classifier (PDC) can outperform a Random Forest, as demonstrated in an experiment with around 100 datasets. Here's a bar plot showing the performance comparison:

By using PDL, you get the benefits of a simpler, more interpretable model with superior performance.

Jacksnipe answered 18/6, 2024 at 20:36 Comment(0)

Recommended topics

Hot tags