Does the following code represent the preferred procedure for traversing the rows of an R data.table
and passing the values found at each row to a function? Or is there a more performant way to do this?
library(data.table)
set.seed(2)
n <- 100
b <- c(0.5, 1.5, -1)
phi <- 0.8
X <- cbind(1, matrix(rnorm(n*2, 0, 1), ncol = 2))
y <- X %*% matrix(b, ncol = 1) + rnorm(n, 0, phi)
d <- data.table(y, X)
setnames(d, c("y", "x0", "x1", "x2"))
logpost <- function(d, b1, b2, b3, phi, mub = 1, taub = 10, a = 0.5, z = 0.7){
N <- nrow(d)
mu <- b1 + b2 * d$x1 + b3 * d$x2
lp <- -N * log(phi) -
(1/(2*phi^2)) * sum( (d$y-mu)^2 ) -
(1/(2*taub^2))*( (b1-mub)^2 + (b2-mub)^2 + (b3-mub)^2 ) -
(a+1)*log(phi) - (z/phi)
lp
}
nn <- 21
grid <- data.table(
expand.grid(b1 = seq(0, 1, len = nn),
b2 = seq(1, 2, len = nn),
b3 = seq(-1.5, -0.5, len = nn),
phi = seq(0.4, 1.2, len = nn)))
grid[, id := 1:.N]
setkey(grid, id)
wraplogpost <- function(dd){
logpost(d, dd$b1, dd$b2, dd$b3, dd$phi)
}
start <- Sys.time()
grid[, lp := wraplogpost(.SD), by = seq_len(nrow(grid))]
difftime(Sys.time(), start)
# Time difference of 2.081544 secs
Edit: display first few records
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152
I have tried using set
but that approach seems inferior
start <- Sys.time()
grid[, lp := NA_real_]
for(i in 1:nrow(grid)){
llpp <- wraplogpost(grid[i])
set(grid, i, "lp", llpp)
}
difftime(Sys.time(), start)
# Time difference of 21.71291 secs
Edit: display first few records
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152
Suggestions or pointers to the relevant docs would be appreciated.
Edit: per comments:
start <- Sys.time()
grid[, lp := wraplogpost(.SD), by = .I]
difftime(Sys.time(), start)
Warning messages:
1: In b2 * d$x1 :
longer object length is not a multiple of shorter object length
2: In b3 * d$x2 :
longer object length is not a multiple of shorter object length
3: In d$y - mu :
longer object length is not a multiple of shorter object length
> difftime(Sys.time(), start)
Time difference of 0.01199317 secs
>
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -620977.2
2: 0.05 1 -1.5 0.4 2 -620977.2
3: 0.10 1 -1.5 0.4 3 -620977.2
4: 0.15 1 -1.5 0.4 4 -620977.2
5: 0.20 1 -1.5 0.4 5 -620977.2
6: 0.25 1 -1.5 0.4 6 -620977.2
which generates the wrong values for lp
.
Edit thank you for the comments and responses. I am aware that this scenario could be addressed by using alternative methods, my interest is in what the preferred way to do this is when using data.table
.
Edit thank you for the responses again. As there have been none that address the question of how to do this explicitly with data.table
, at the moment, I am assuming that there is no ideal way to achieve this without turning to base R.
by = .I
. It's faster, see?.I
. – Ululate.I
should be used for obtaining the row indices inj
rather than as aby
term. Also the answer here: https://mcmap.net/q/676576/-row-operations-in-data-table-using-by-i, suggests (to me at least) that.I
should not be used in the by clause. Am I interpreting that answer incorrectly? – Casebook.I
returnsseq_len(nrow(grid))
but is faster since it's a value computed by data.table. Try it. – Ululate$
extractions. Your loop would work better if your data was a matrix as opposed to a list (i.e,data.table
). – Bushelman