Hyper-parameter tuning using pure ranger package in R
Asked Answered
T

5

12

Love the speed of the ranger package for random forest model creation, but can't see how to tune mtry or number of trees. I realize I can do this via caret's train() syntax, but I prefer the speed increase that comes from using pure ranger.

Here's my example of basic model creation using ranger (which works great):

library(ranger)
data(iris)

fit.rf = ranger(
  Species ~ .,
  training_data = iris,
  num.trees = 200
)

print(fit.rf)

Looking at the official documentation for tuning options, it seems like the csrf() function may provide the ability to tune hyper-parameters, but I can't get the syntax right:

library(ranger)
data(iris)

fit.rf.tune = csrf(
  Species ~ .,
  training_data = iris,
  params1 = list(num.trees = 25, mtry=4),
  params2 = list(num.trees = 50, mtry=4)
)

print(fit.rf.tune)

Results in:

Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : 
  unused argument (training_data = iris)

And I'd prefer to tune with the regular (read: non-csrf) rf algorithm ranger provides. Any idea as to a hyper-parameter tuning solution for either path in ranger? Thank you!

Tonometer answered 29/5, 2016 at 20:24 Comment(0)
S
6

I think there are at least two errors:

First, the function ranger does not have a parameter called training_data. Your error message Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris) refers to that. You can see that when you look at ?ranger or args(ranger).

Second, the function csrf, on the other hand, has training_data as input, but also requires test_data. Most importantly, these two arguments do not have any defaults, implying that you must provide them. The following works without problems:

fit.rf = ranger(
  Species ~ ., data = iris,
  num.trees = 200
)

fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)

Here, I have just provided iris as both training and test dataset. You would obviously not want to do that in your real application. Moreover, note that ranger also take num.trees and mtry as input, so you could try tuning it there.

Slush answered 29/5, 2016 at 21:5 Comment(4)
Fantastic info, thanks! To your knowledge, there's no non-csrf route to hyper-parameter tuning in ranger? Also, Zheyuan, I did originally ask if a non-csrf option was available (and not just for a fix for the documented csrf implementation).Tonometer
Very generous, guys, thanks. Just a note, coffeinjunky--even though the error message I posted said I had used the ranger function, I had actually used the csrf function (not sure if you want to edit your response). I'll email Marvin Wright (the maintainer) an FYI about this. Thanks, again!Tonometer
Also, coffeinjunky, if you're editing, would you mind adding an example of param1, param2 syntax for tuning with ranger function? Thanks!Tonometer
Just add num.trees=5 or any other number, or mtry=5 or any other number, to your call. As in ranger(Species ~ ., data = iris, num.trees = 200, mtry=5)Slush
T
18

To answer my (unclear) question, apparently ranger has no built-in CV/GridSearch functionality. However, here's how you do hyper-parameter tuning with ranger (via a grid search) outside of caret. Thanks goes to Marvin Wright (the maintainer of ranger) for the code. Turns out caret CV with ranger was slow for me because I was using the formula interface (which should be avoided).

ptm <- proc.time()
library(ranger)
library(mlr)

# Define task and learner
task <- makeClassifTask(id = "iris",
                        data = iris,
                        target = "Species")

learner <- makeLearner("classif.ranger")

# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
                   makeDiscreteParam("num.trees", 200))

# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
           control = makeTuneControlGrid())

# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)

print(m)
print(proc.time() - ptm) # ~6 seconds

For the curious, the caret equivalent is

ptm <- proc.time()
library(caret)
data(iris)

grid <-  expand.grid(mtry = c(3,4))

fitControl <- trainControl(method = "CV",
                           number = 5,
                           verboseIter = TRUE)

fit = train(
  x = iris[ , names(iris) != 'Species'],
  y = iris[ , names(iris) == 'Species'],
  method = 'ranger',
  num.trees = 200,
  tuneGrid = grid,
  trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds

Overall, caret is the fastest way to do a grid search with ranger if one uses the non-formula interface.

Tonometer answered 15/6, 2016 at 21:14 Comment(1)
Thanks for providing these solutions. Quick question, is it possible to include a list of num.tree hyperparameters within the search grid?Cottar
S
6

I think there are at least two errors:

First, the function ranger does not have a parameter called training_data. Your error message Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris) refers to that. You can see that when you look at ?ranger or args(ranger).

Second, the function csrf, on the other hand, has training_data as input, but also requires test_data. Most importantly, these two arguments do not have any defaults, implying that you must provide them. The following works without problems:

fit.rf = ranger(
  Species ~ ., data = iris,
  num.trees = 200
)

fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)

Here, I have just provided iris as both training and test dataset. You would obviously not want to do that in your real application. Moreover, note that ranger also take num.trees and mtry as input, so you could try tuning it there.

Slush answered 29/5, 2016 at 21:5 Comment(4)
Fantastic info, thanks! To your knowledge, there's no non-csrf route to hyper-parameter tuning in ranger? Also, Zheyuan, I did originally ask if a non-csrf option was available (and not just for a fix for the documented csrf implementation).Tonometer
Very generous, guys, thanks. Just a note, coffeinjunky--even though the error message I posted said I had used the ranger function, I had actually used the csrf function (not sure if you want to edit your response). I'll email Marvin Wright (the maintainer) an FYI about this. Thanks, again!Tonometer
Also, coffeinjunky, if you're editing, would you mind adding an example of param1, param2 syntax for tuning with ranger function? Thanks!Tonometer
Just add num.trees=5 or any other number, or mtry=5 or any other number, to your call. As in ranger(Species ~ ., data = iris, num.trees = 200, mtry=5)Slush
P
5

Another way to tune the model is to create a manual grid, maybe there are better ways to train the model but this may be a different option.

hyper_grid <- expand.grid(
  mtry       = 1:4,
  node_size  = 1:3,
  num.trees = seq(50,500,50),
  OOB_RMSE   = 0
)

system.time(
  for(i in 1:nrow(hyper_grid)) {
    # train model
    rf <- ranger(
      formula        = Species ~ .,
      data           = iris,
      num.trees      = hyper_grid$num.trees[i],
      mtry           = hyper_grid$mtry[i],
      min.node.size  = hyper_grid$node_size[i],
      importance = 'impurity')
    # add OOB error to grid
    hyper_grid$OOB_RMSE[i] <- sqrt(rf$prediction.error)
  })
user  system elapsed 
3.17    0.19    1.36

nrow(hyper_grid) # 120 models
position = which.min(hyper_grid$OOB_RMSE)
head(hyper_grid[order(hyper_grid$OOB_RMSE),],5)
     mtry node_size num.trees     OOB_RMSE
6     2         2        50 0.1825741858
23    3         3       100 0.1825741858
3     3         1        50 0.2000000000
11    3         3        50 0.2000000000
14    2         1       100 0.2000000000

# fit best model
rf.model <- ranger(Species ~ .,data = iris, num.trees = hyper_grid$num.trees[position], importance = 'impurity', probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
rf.model
Ranger result

Call:
 ranger(Species ~ ., data = iris, num.trees = hyper_grid$num.trees[position], importance = "impurity", probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position]) 

    Type:                             Classification 
Number of trees:                  50 
Sample size:                      150 
Number of independent variables:  4 
Mtry:                             2 
Target node size:                 2 
Variable importance mode:         impurity 
Splitrule:                        gini 
OOB prediction error:             5.33 % 

I hope it serves you.

Pangermanism answered 24/9, 2018 at 14:48 Comment(0)
P
4

Note that mlr per default disables the internal parallelization of ranger. Set hyperparameter num.threads to the number of cores available to speed mlr up:

learner <- makeLearner("classif.ranger", num.threads = 4)

Alternatively, start a parallel backend via

parallelStartMulticore(4) # linux/osx
parallelStartSocket(4)    # windows

before calling tuneParams to parallelize the tuning.

Pelagian answered 31/1, 2018 at 11:12 Comment(0)
M
0

There is also the tuneRanger R package, which is specifically designed for tuning ranger and uses predefined tuning parameters, hyperparameter spaces and intelligent tuning by using the out-of-bag observations.

Note, that random forest is not an algorithm were tuning makes a big difference, usually. But it can usually improve the performance a bit.

Manifold answered 3/8, 2020 at 7:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.