I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF
package. The methods behind this package are detailed here :
Conveniently the authors provide some useful data generating functions for testing. So we have
install.packages("LongituRF")
library(LongituRF)
Let's generate some data with DataLongGenerator()
which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.
my_data <- DataLongGenerator(n=50,p=6,G=6)
my_data
is a list of what you'd expect Y (response vector),
X (matrix of fixed effects predictors), Z (matrix of random-effects predictors),
id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply
model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
id=my_data$id,sto="BM",mtry=2)
takes about 50secs here so bear with me
so far so good. Now im clear about all the parameters here except for Z
. What is Z
when i go to fit this model on my actual data?
Looking at my_data$Z
.
dim(my_data$Z)
[1] 471 2
head(my_data$Z)
[,1] [,2]
[1,] 1 1.1128914
[2,] 1 1.0349287
[3,] 1 0.7308948
[4,] 1 1.0976203
[5,] 1 1.3739856
[6,] 1 0.6840415
Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif()
.
The documentation of REEMforest()
indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?
My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z
from the DataLongGenerator()
should be nxG (471x6) sparse matrix no?
Clarity on how to specify the Z
parameter with actual data would be appreciated.
EDIT
My specific example is as follows, i have a response variable (Y
). Samples (identified with id
) were randomly assigned to intervention (I
, intervention or no intervention). A high dimensional set of features (X
). Features and response were measured at two timepoints (Time
, baseline and endpoint). I am interested in predicting Y
, using X
and I
. I am also interested in extracting which features were most important to predicting Y
(the same way Capitaine et al. did with HIV in their paper).
I will call REEMforest()
as follows
REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)
What should i use for Z
?