Proximity Matrix - Random Forest , R
Asked Answered
F

3

10

I am using the randomForest package in R, which allows to calculate the proximity matrix (P). In the description of the package it describes the parameter as: "if proximity=TRUE when randomForest is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes)."

I obtain the proximity matrix of a random forest as follows:

P <- randomForest(x, y, ntree = 1000, proximity=TRUE)$proximity

When I investigate the P matrix, I see values like P(i,j)=0.971014493 where i and j are two data instances within my training data set (x). Such a value does not make sense, because when it is multplied by 1000 (number of trees in the forest), the resulting number is not an integer, hence "frequency". Could someone please help me understand, why do I get such real numbers in the proximity matrix?

Fundamentalism answered 20/5, 2014 at 14:0 Comment(0)
Z
13

Because just as with the default predictions, the default proximity is calculated only using the trees where neither observation was included in the sample used to build that tree (they were "out-of-bag").

The number of times this happens will vary slightly for each pair of cases, and certainly won't be a nice round number like 1000.

You'll note that the very next parameter listed after proximity is called oob.prox indicating whether to use only out of bag pairs (the default) or use each and every tree.

Ziagos answered 20/5, 2014 at 15:10 Comment(0)
E
6

Just to add to the above answer, since this looked weird for me too and in case it will help anyone, that according to Breiman ( and I quote):

'An intrinsic proximity measure.

Since an individual tree is unpruned, the terminal nodes will contain only a small number of instances. Run all cases in the training set down the tree. If case i and case j both land in the same terminal node. increase the proximity between i and j by one. At the end of the run, the proximities are divided by twice the number of trees in the run and proximity between a case and itself set equal to one.'

The above was mentioned in Breiman's paper 'Using Random Forests', which is a reference for the randomForest function here.

Earvin answered 28/10, 2014 at 15:15 Comment(2)
He says "Run all cases in the training set down the tree." I thought this was supposed to only be Out Of Bag cases.Criticize
@Criticize Breiman may be using another approach that the one from R's randomForest.Monosome
O
4

Proximity is the proportion how often two data points end in the same leaf node for different trees.

Oakley answered 19/2, 2015 at 11:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.