Implementing Longitudinal Random Forest with LongituRF package in R

I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF package. The methods behind this package are detailed here :

Capitaine, L., et al. Random forests for high-dimensional longitudinal data. Stat Methods Med Res (2020) doi:10.1177/0962280220946080.

Conveniently the authors provide some useful data generating functions for testing. So we have

install.packages("LongituRF")
library(LongituRF)

Let's generate some data with DataLongGenerator() which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.

my_data <- DataLongGenerator(n=50,p=6,G=6)

my_data is a list of what you'd expect Y (response vector), X (matrix of fixed effects predictors), Z (matrix of random-effects predictors), id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply

model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
                    id=my_data$id,sto="BM",mtry=2)

takes about 50secs here so bear with me

so far so good. Now im clear about all the parameters here except for Z. What is Z when i go to fit this model on my actual data?

Looking at my_data$Z.

dim(my_data$Z)
[1] 471   2
head(my_data$Z)
      [,1]      [,2]
 [1,]    1 1.1128914
 [2,]    1 1.0349287
 [3,]    1 0.7308948
 [4,]    1 1.0976203
 [5,]    1 1.3739856
 [6,]    1 0.6840415

Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif().

The documentation of REEMforest() indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?

My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z from the DataLongGenerator() should be nxG (471x6) sparse matrix no?

Clarity on how to specify the Z parameter with actual data would be appreciated.

EDIT

My specific example is as follows, i have a response variable (Y). Samples (identified with id) were randomly assigned to intervention (I, intervention or no intervention). A high dimensional set of features (X). Features and response were measured at two timepoints (Time, baseline and endpoint). I am interested in predicting Y, using X and I. I am also interested in extracting which features were most important to predicting Y (the same way Capitaine et al. did with HIV in their paper).

I will call REEMforest() as follows

REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)

What should i use for Z?

data2 <- data %>% mutate(oOrder = row_number()) %>% # identify original order of the data arrange(time, id) %>% mutate(zOrder = row_number()) # because the random effects will be in order by time then id extRE <- data.frame(time = attributes(fit$RandomEffects[2][["id"]])[["row.names"]]) %>% separate(col = time, into = c("time", "id"), sep = "\\/") %>% mutate(Z = fit$RandomEffects[[2]] %>% unlist(), id = as.integer(id), time = time)) # set data type to match dataset for time data2 <- data2 %>% left_join(extRE) %>% arrange(oOrder) # return to original order Z = cbind(rep(1, times = nrows(data2)), data2$Z)

Recommended topics

Hot tags