Ah. This is either an oversight in combine
or what you're trying to do is nonsensical, depending on your point of view.
The votes matrix records the number of votes in the forest for each case in the training data for each response category. Naturally, it will have the same number of rows as the number of rows in your training data.
combine
is assuming that you ran your random forests twice on the same set of data, so the dimensions of those matrices will be the same. It's doing this because it wants to provide you with some "overall" error estimates for the combined forest.
But if the two data sets are different combining the votes matrices becomes simply nonsensical. You could get combine
to run by simply removing one row from your larger training data set, but the resulting votes matrix in the combined forest would be gibberish, since each row would be a combination of votes for two different training cases.
So maybe this is simply something that should be an option that can be turned off in combine
. Because it should still make sense to combine the actual trees and predict
on the resulting object. But some of "combined" error estimates in the output from combine
will be meaningless.
Long story short, make each training data set the same size, and it will run. But if you do, I wouldn't use the resulting object for anything other than making new predictions. Anything that is combined that was summarizing the performance of the forests will be nonsense.
However, I think the intended way to use combine
is to fit multiple random forests on the full data set, but with a reduced number of trees and then to combine those forests.
Edit
I went ahead and modified combine
to "handle" unequal training set sizes. All that means really is that I removed a large chunk of code that was trying to stitch things together that weren't going to match up. But I kept the portion that combines the forests, so you can still use predict
:
my_combine <- function (...)
{
pad0 <- function(x, len) c(x, rep(0, len - length(x)))
padm0 <- function(x, len) rbind(x, matrix(0, nrow = len -
nrow(x), ncol = ncol(x)))
rflist <- list(...)
areForest <- sapply(rflist, function(x) inherits(x, "randomForest"))
if (any(!areForest))
stop("Argument must be a list of randomForest objects")
rf <- rflist[[1]]
classRF <- rf$type == "classification"
trees <- sapply(rflist, function(x) x$ntree)
ntree <- sum(trees)
rf$ntree <- ntree
nforest <- length(rflist)
haveTest <- !any(sapply(rflist, function(x) is.null(x$test)))
vlist <- lapply(rflist, function(x) rownames(importance(x)))
numvars <- sapply(vlist, length)
if (!all(numvars[1] == numvars[-1]))
stop("Unequal number of predictor variables in the randomForest objects.")
for (i in seq_along(vlist)) {
if (!all(vlist[[i]] == vlist[[1]]))
stop("Predictor variables are different in the randomForest objects.")
}
haveForest <- sapply(rflist, function(x) !is.null(x$forest))
if (all(haveForest)) {
nrnodes <- max(sapply(rflist, function(x) x$forest$nrnodes))
rf$forest$nrnodes <- nrnodes
rf$forest$ndbigtree <- unlist(sapply(rflist, function(x) x$forest$ndbigtree))
rf$forest$nodestatus <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$nodestatus, nrnodes)))
rf$forest$bestvar <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$bestvar, nrnodes)))
rf$forest$xbestsplit <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$xbestsplit, nrnodes)))
rf$forest$nodepred <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$nodepred, nrnodes)))
tree.dim <- dim(rf$forest$treemap)
if (classRF) {
rf$forest$treemap <- array(unlist(lapply(rflist,
function(x) apply(x$forest$treemap, 2:3, pad0,
nrnodes))), c(nrnodes, 2, ntree))
}
else {
rf$forest$leftDaughter <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$leftDaughter, nrnodes)))
rf$forest$rightDaughter <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$rightDaughter, nrnodes)))
}
rf$forest$ntree <- ntree
if (classRF)
rf$forest$cutoff <- rflist[[1]]$forest$cutoff
}
else {
rf$forest <- NULL
}
#
#Tons of stuff removed here...
#
if (classRF) {
rf$confusion <- NULL
rf$err.rate <- NULL
if (haveTest) {
rf$test$confusion <- NULL
rf$err.rate <- NULL
}
}
else {
rf$mse <- rf$rsq <- NULL
if (haveTest)
rf$test$mse <- rf$test$rsq <- NULL
}
rf
}
And then you can test it like this:
data(iris)
d <- iris[sample(150,150),]
d1 <- d[1:70,]
d2 <- d[71:150,]
rf1 <- randomForest(Species ~ ., d1, ntree=50, norm.votes=FALSE)
rf2 <- randomForest(Species ~ ., d2, ntree=50, norm.votes=FALSE)
rf.all <- my_combine(rf1,rf2)
predict(rf.all,newdata = iris)
Obviously, this comes with absolutely no warranty! :)
str
to see). Do they have exactly the same variables, all named the same way? – Derringerstr
the first lines are:'data.frame': 38735 obs. of 55 variables:
for d1 and'data.frame': 38734 obs. of 55 variables:
for d2. followed by the same names for each data set. – Tapes> length(rf1$votes) [1] 271145
and>length(rf2$votes) [1] 271138
. first what do you mean by factor levels? and second i read in the documentation that the votes are 'a matrix with one row for each input data point and one column for each class, giving the fraction or number of (OOB) ‘votes’ from the random forest.' it makes sense this would be imbalanced since the data is imbalanced but where do these large lengths come from? – Tapes