Combining random forests built with different training sets in R
Asked Answered
T

1

11

I am new to R (day 2) and have been tasked with building a forest of random forests. Each individual random forest will be built using a different training set and we will combine all the forests at the end to make predictions. I am implementing this in R and am having some difficulty combining two forests not built using the same set. My attempt is as follows:

d1 = read.csv("../data/rr/train/10/chunk0.csv",header=TRUE)
d2 = read.csv("../data/rr/train/10/chunk1.csv",header=TRUE)

rf1 = randomForest(A55~., data=d1, ntree=10)
rf2 = randomForest(A55~., data=d2, ntree=10)

rf = combine(rf1,rf2)

This of course produces an error:

Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) : 
non-conformable arrays
In addition: Warning message:
In rf$oob.times + rflist[[i]]$oob.times :
longer object length is not a multiple of shorter object length

I have been browsing the web for some time looking at a clue for this but haven't had any success yet. Any help here would be most appreciated.

Tapes answered 3/10, 2013 at 22:17 Comment(5)
What are the exact structures of the two data sets? (use str to see). Do they have exactly the same variables, all named the same way?Derringer
@Derringer they do. they are both subsets of a larger training set which i have manually split apart. when i run str the first lines are: 'data.frame': 38735 obs. of 55 variables: for d1 and 'data.frame': 38734 obs. of 55 variables: for d2. followed by the same names for each data set.Tapes
Check the dimensions of the votes objects of each rf object. It is possible that not all factor levels were present in each subset of your training data.Derringer
@Derringer so while i dont really understand what that means, there is a difference: > length(rf1$votes) [1] 271145 and >length(rf2$votes) [1] 271138. first what do you mean by factor levels? and second i read in the documentation that the votes are 'a matrix with one row for each input data point and one column for each class, giving the fraction or number of (OOB) ‘votes’ from the random forest.' it makes sense this would be imbalanced since the data is imbalanced but where do these large lengths come from?Tapes
I believe I have a solution below. However, if you're confused by the terminology surrounding factors (a fundamental data type in R) then you're going to be in way over your head really quick trying to do serious stuff with random forests. I would get very familiar with sections 2-6 of this before you get much further.Derringer
D
30

Ah. This is either an oversight in combine or what you're trying to do is nonsensical, depending on your point of view.

The votes matrix records the number of votes in the forest for each case in the training data for each response category. Naturally, it will have the same number of rows as the number of rows in your training data.

combine is assuming that you ran your random forests twice on the same set of data, so the dimensions of those matrices will be the same. It's doing this because it wants to provide you with some "overall" error estimates for the combined forest.

But if the two data sets are different combining the votes matrices becomes simply nonsensical. You could get combine to run by simply removing one row from your larger training data set, but the resulting votes matrix in the combined forest would be gibberish, since each row would be a combination of votes for two different training cases.

So maybe this is simply something that should be an option that can be turned off in combine. Because it should still make sense to combine the actual trees and predict on the resulting object. But some of "combined" error estimates in the output from combine will be meaningless.

Long story short, make each training data set the same size, and it will run. But if you do, I wouldn't use the resulting object for anything other than making new predictions. Anything that is combined that was summarizing the performance of the forests will be nonsense.

However, I think the intended way to use combine is to fit multiple random forests on the full data set, but with a reduced number of trees and then to combine those forests.

Edit

I went ahead and modified combine to "handle" unequal training set sizes. All that means really is that I removed a large chunk of code that was trying to stitch things together that weren't going to match up. But I kept the portion that combines the forests, so you can still use predict:

my_combine <- function (...) 
{
    pad0 <- function(x, len) c(x, rep(0, len - length(x)))
    padm0 <- function(x, len) rbind(x, matrix(0, nrow = len - 
        nrow(x), ncol = ncol(x)))
    rflist <- list(...)
    areForest <- sapply(rflist, function(x) inherits(x, "randomForest"))
    if (any(!areForest)) 
        stop("Argument must be a list of randomForest objects")
    rf <- rflist[[1]]
    classRF <- rf$type == "classification"
    trees <- sapply(rflist, function(x) x$ntree)
    ntree <- sum(trees)
    rf$ntree <- ntree
    nforest <- length(rflist)
    haveTest <- !any(sapply(rflist, function(x) is.null(x$test)))
    vlist <- lapply(rflist, function(x) rownames(importance(x)))
    numvars <- sapply(vlist, length)
    if (!all(numvars[1] == numvars[-1])) 
        stop("Unequal number of predictor variables in the randomForest objects.")
    for (i in seq_along(vlist)) {
        if (!all(vlist[[i]] == vlist[[1]])) 
            stop("Predictor variables are different in the randomForest objects.")
    }
    haveForest <- sapply(rflist, function(x) !is.null(x$forest))
    if (all(haveForest)) {
        nrnodes <- max(sapply(rflist, function(x) x$forest$nrnodes))
        rf$forest$nrnodes <- nrnodes
        rf$forest$ndbigtree <- unlist(sapply(rflist, function(x) x$forest$ndbigtree))
        rf$forest$nodestatus <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$nodestatus, nrnodes)))
        rf$forest$bestvar <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$bestvar, nrnodes)))
        rf$forest$xbestsplit <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$xbestsplit, nrnodes)))
        rf$forest$nodepred <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$nodepred, nrnodes)))
        tree.dim <- dim(rf$forest$treemap)
        if (classRF) {
            rf$forest$treemap <- array(unlist(lapply(rflist, 
                function(x) apply(x$forest$treemap, 2:3, pad0, 
                  nrnodes))), c(nrnodes, 2, ntree))
        }
        else {
            rf$forest$leftDaughter <- do.call("cbind", lapply(rflist, 
                function(x) padm0(x$forest$leftDaughter, nrnodes)))
            rf$forest$rightDaughter <- do.call("cbind", lapply(rflist, 
                function(x) padm0(x$forest$rightDaughter, nrnodes)))
        }
        rf$forest$ntree <- ntree
        if (classRF) 
            rf$forest$cutoff <- rflist[[1]]$forest$cutoff
    }
    else {
        rf$forest <- NULL
    }
    #
    #Tons of stuff removed here...
    #
    if (classRF) {
        rf$confusion <- NULL
        rf$err.rate <- NULL
        if (haveTest) {
            rf$test$confusion <- NULL
            rf$err.rate <- NULL
        }
    }
    else {
        rf$mse <- rf$rsq <- NULL
        if (haveTest) 
            rf$test$mse <- rf$test$rsq <- NULL
    }
    rf
}

And then you can test it like this:

data(iris)
d <- iris[sample(150,150),]
d1 <- d[1:70,]
d2 <- d[71:150,]
rf1 <- randomForest(Species ~ ., d1, ntree=50, norm.votes=FALSE)
rf2 <- randomForest(Species ~ ., d2, ntree=50, norm.votes=FALSE)

rf.all <- my_combine(rf1,rf2)
predict(rf.all,newdata = iris)

Obviously, this comes with absolutely no warranty! :)

Derringer answered 4/10, 2013 at 2:5 Comment(5)
Thanks for the detailed response @joran. I wonder that if the votes matrices were the same size if combining and predicting would make any sense since, as you say, each row is a combination of two different training cases. Thanks for the link as well I will check that out. Sadly, in my case making each training set the same size isn't an option because one thing I'd like to see is how skewed data affects the overall performance of this system. On top bigger and (hopefully) better ideas!Tapes
Thanks again, thats really great. What i ended up doing was creating a bunch of forests and doing a prediction on each one separately. then i combined those matrices into one larger prediction matrix. thanks again though!Tapes
@Derringer : Can we combine two forests each fit on a different set of predictors? I have 5 predictors...when I use 2 of the predictors to predict classes the fit model has some 70% accuracy....when i use the other 3, I get some 65% accuracy. But when I use all 5 the accuracy stays at 65%Shunt
Really great solution @joran. Can this be added to the package? I mean it would be really helpful.Genesis
I am a little confused. Are you saying that combining the RF models made from multiple subsets of the same data (that have some overlapping rows) produces an issue in the error summaries? Because I would imagine that combining RFs for different "disjoint" datasets would be okay (i.e. group1 = rows1:100000 and group2 = rows1000001:2000000) since you would just append the vote matrix from the first to the second. Am I understanding this correctly? But then why would subtracting a single row allow combine to work?Interrupted

© 2022 - 2024 — McMap. All rights reserved.