How to do row wise operations on .SD columns in data.table
Asked Answered
E

2

3

Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...

I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.

As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?

Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)

library(data.table)

set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
                 q1_exam1 = round(rnorm(10, .78, .05), 2),
                 q1_exam2 = round(rnorm(10, .68, .02), 2),
                 q1_final = round(rnorm(10, .88, .08), 2),
                 q2_exam1 = round(rnorm(10, .78, .05), 2),
                 q2_exam2 = round(rnorm(10, .68, .10), 2),
                 q2_final = round(rnorm(10, .88, .04), 2))

dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89
Ephialtes answered 26/10, 2015 at 18:28 Comment(10)
If you want to do rowwise operations in data.table, you can create an index variable with .I and use that in the by = part.Arlie
If you're concerned about efficiency, eddi's answer to my question here is quite good: https://mcmap.net/q/1176183/-efficiently-computing-a-linear-combination-of-data-table-columns (and, I think, relevant)Baskin
I' guessing that you could gain efficiency and clarity either by having your data as "data.frame" with columns like ["id" "quarter" "exam"] and use database operations or by having a "list" (depending on a convenient factor) of "matrices" and use more 'arithmetic' operations.Leucite
@Arlie Do you mean use .I like this: #16574495 ? (I didn't initially realize this example might be relevant, thanks Frank for showing me this on github)Ephialtes
Use by = 1:nrow(dt). Or depending on the operation, you might be able to use Reduce() in j.Skyway
Sorry, voting to close as "primarily opinion based". If it gets closed before you've found a satisfactory answer, you might consider posting a different question (or editing this one). There is also a package mailing list -- not the github project site itself, but linked from the wiki under "discussion": github.com/Rdatatable/data.table/wiki/Getting-startedBaskin
@frank How is this question opinion based?Ephialtes
It's opinion-based, in stackoverflow parlance, because there are many possible answers to "how would you do this differently?" Usually, it would be recommended that you post this sort of thing on codereview.stackexchange.com/tour but I mention the mailing list because I have no experience with the code review site.Baskin
@Baskin Ok, I edited the questionEphialtes
Ok, thanks. That works, I think.Baskin
B
4

Here are a few thoughts on your options, largely gathered from the comments:

apply along rows

The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.

rowMeans and similarly-named functions also coerce to a matrix.

Split by rows

As @Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.

Efficiently create new columns

This approach taken from eddi is probably the most efficient if you must keep your data in wide format:

jwts = list( 
  q1_AVG  = c(q1_exam1 = 1  , q1_exam2 = 1  , q1_final =   1)/3, 
  q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)


for (newj in names(jwts)){
  w = jwts[[newj]]
  dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}

This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).

Go long

As @alexis_laz suggested, you might gain clarity and efficiency with a different structure, like

# reshape
m = melt(dt, id.vars="id", value.name="score")[,
  c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]

# input your weighting rules
w = unique(m[,c("quarter","exam")])
w[quarter=="q1"                , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]

# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]

This is what I would do.


In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.

Baskin answered 27/10, 2015 at 17:0 Comment(2)
The logic in "Go long" makes a lot of sense (and will work better for my work). At the risk of sounding opinion based, I'd say this seems like the "right" way to think about it. In fact, this sounds like the "right" way to think through the problem.Ephialtes
Also, it's a good example to study to show the value of the new tstrsplit and on features.Ephialtes
E
1

In this case it is possible to use the apply function in base R, but that's not taking advantage of the data.table framework. Also, it doesn't generalize because there are cases which will require more conditional checking.

apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)

# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
#  [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333

In this case, again it's possible to put apply into the j argument of the data.table, and use it on the .SD columns:

dt[i = TRUE,
   q1_AVG := round(apply(.SD, 1, mean), 2), 
   .SDcols = grep("^q1", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77

The case with the weighted average can be calculated using matrix multiplication;

dt[i = TRUE,
   q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2), 
   .SDcols = grep("^q2", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77    0.80
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73    0.78
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74    0.82
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72    0.87
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75    0.77
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78    0.80
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73    0.80
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75    0.78
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76    0.77
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77    0.82
Ephialtes answered 26/10, 2015 at 18:28 Comment(1)
I think if you use rowMeans on matrix and then the %*% on that matrix, would be efficient. In the code you showed, apply with MARGIN=1 is used, which is not that fast. Also, by converting .SD to matrix, it is not using the data.table techniques.Dominate

© 2022 - 2024 — McMap. All rights reserved.