I'm trying to train several random forests (for regression) to have them compete and see which feature selection and which parameters give the best model.
However the trainings seem to take an insane amount of time, and I'm wondering if I'm doing something wrong.
The dataset I'm using for training (called train
below) has 217k lines, and 58 columns (of which only 21 serve as predictors in the random forest. They're all numeric
or integer
, with the exception of a boolean one, which is of class character
. The y
output is numeric
).
I ran the following code four times, giving the values 4
, 100
, 500
, 2000
to nb_trees
:
library("randomForest")
nb_trees <- #this changes with each test, see above
ptm <- proc.time()
fit <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
+ x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19
+ x20 + x21,
data = train,
ntree = nb_trees,
do.trace=TRUE)
proc.time() - ptm
Here is how long each of them took to train :
nb_trees | time
4 4mn
100 1h 41mn
500 8h 40mn
2000 34h 26mn
As my company's server has 12 cores and 125Go of RAM, I figured I could try to parallelize the training, following this answer (however, I used the doParallel
package because it seemed to be running forever with doSNOW
, I don't know why. And I can't find where I saw that doParallel
would work too, sorry).
library("randomForest")
library("foreach")
library("doParallel")
nb_trees <- #this changes with each test, see table below
nb_cores <- #this changes with each test, see table below
cl <- makeCluster(nb_cores)
registerDoParallel(cl)
ptm <- proc.time()
fit <- foreach(ntree = rep(nb_trees, nb_cores), .combine = combine, .packages = "randomForest")
%dopar% {
randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
+ x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19
+ x20 + x21,
data = train,
ntree = ntree,
do.trace=TRUE)}
proc.time() - ptm
stopCluster(cl)
When I run it, it takes a shorter time than non-parallelized code :
nb_trees | nb_cores | total number of trees | time
1 4 4 2mn13s
10 10 100 52mn
9 12 108 (closest to 100 with 12 cores) 59mn
42 12 504 (closest to 500 with 12 cores) I won't be running this one
167 12 2004 (closest to 2000 with 12 cores) I'll run it next week-end
However, I think it's still taking a lot of time, isn't it ? I'm aware it takes time to combine the trees into the final forest, so I didn't expect it to be 12 times faster with 12 cores, but it's only ~2 times faster...
- Is this normal ?
- If it isn't, is there anything I can do with my data and/or my code to radically decrease the running time ?
- If not, should I tell the guy in charge of the server that it should be much faster ?
Thanks for your answers.
Notes :
- I'm the only one using this server
- for my next tests, I'll get rid of the columns that are not used in the random forest
- I realized quite late that I could improve the running time by calling
randomForest(predictors,decision)
instead ofrandomForest(decision~.,data=input)
, and I'll be doing it from now on, but I think my questions above still holds.
do.trace = TRUE
, so that I can see how it evolves as a function of the number of tree. Is there a similar parameter to also see how the top predictors evolve ? (So that I can run the training only once, with 512 trees) – Bespectacled