How to do row wise operations on .SD columns in data.table

Asked 26/10, 2015 at 18:28 Answered 27/10, 2015 at 17:0

Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...

I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.

As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?

Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)

library(data.table)

set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
                 q1_exam1 = round(rnorm(10, .78, .05), 2),
                 q1_exam2 = round(rnorm(10, .68, .02), 2),
                 q1_final = round(rnorm(10, .88, .08), 2),
                 q2_exam1 = round(rnorm(10, .78, .05), 2),
                 q2_exam2 = round(rnorm(10, .68, .10), 2),
                 q2_final = round(rnorm(10, .88, .04), 2))

dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89

Ephialtes answered 26/10, 2015 at 18:28 Comment(10)

If you want to do rowwise operations in data.table, you can create an index variable with .I and use that in the by = part. – Arlie 26/10, 2015 at 18:31

If you're concerned about efficiency, eddi's answer to my question here is quite good: https://mcmap.net/q/1176183/-efficiently-computing-a-linear-combination-of-data-table-columns (and, I think, relevant) – Baskin 26/10, 2015 at 18:35

I' guessing that you could gain efficiency and clarity either by having your data as "data.frame" with columns like ["id" "quarter" "exam"] and use database operations or by having a "list" (depending on a convenient factor) of "matrices" and use more 'arithmetic' operations. – Leucite 26/10, 2015 at 18:53

@Arlie Do you mean use .I like this: #16574495 ? (I didn't initially realize this example might be relevant, thanks Frank for showing me this on github) – Ephialtes 26/10, 2015 at 19:4

Use by = 1:nrow(dt). Or depending on the operation, you might be able to use Reduce() in j. – Skyway 26/10, 2015 at 19:10

Sorry, voting to close as "primarily opinion based". If it gets closed before you've found a satisfactory answer, you might consider posting a different question (or editing this one). There is also a package mailing list -- not the github project site itself, but linked from the wiki under "discussion": github.com/Rdatatable/data.table/wiki/Getting-started – Baskin 26/10, 2015 at 19:34

@frank How is this question opinion based? – Ephialtes 27/10, 2015 at 13:16

It's opinion-based, in stackoverflow parlance, because there are many possible answers to "how would you do this differently?" Usually, it would be recommended that you post this sort of thing on codereview.stackexchange.com/tour but I mention the mailing list because I have no experience with the code review site. – Baskin 27/10, 2015 at 15:4

@Baskin Ok, I edited the question – Ephialtes 27/10, 2015 at 16:36

Ok, thanks. That works, I think. – Baskin 27/10, 2015 at 17:1

Here are a few thoughts on your options, largely gathered from the comments:

apply along rows

The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.

rowMeans and similarly-named functions also coerce to a matrix.

Split by rows

As @Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.

Efficiently create new columns

This approach taken from eddi is probably the most efficient if you must keep your data in wide format:

jwts = list( 
  q1_AVG  = c(q1_exam1 = 1  , q1_exam2 = 1  , q1_final =   1)/3, 
  q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)


for (newj in names(jwts)){
  w = jwts[[newj]]
  dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}

This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).

Go long

As @alexis_laz suggested, you might gain clarity and efficiency with a different structure, like

# reshape
m = melt(dt, id.vars="id", value.name="score")[,
  c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]

# input your weighting rules
w = unique(m[,c("quarter","exam")])
w[quarter=="q1"                , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]

# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]

This is what I would do.

In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.

Baskin answered 27/10, 2015 at 17:0 Comment(2)

The logic in "Go long" makes a lot of sense (and will work better for my work). At the risk of sounding opinion based, I'd say this seems like the "right" way to think about it. In fact, this sounds like the "right" way to think through the problem. – Ephialtes 27/10, 2015 at 17:17

Also, it's a good example to study to show the value of the new tstrsplit and on features. – Ephialtes 27/10, 2015 at 17:19

In this case it is possible to use the apply function in base R, but that's not taking advantage of the data.table framework. Also, it doesn't generalize because there are cases which will require more conditional checking.

apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)

# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
#  [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333

In this case, again it's possible to put apply into the j argument of the data.table, and use it on the .SD columns:

dt[i = TRUE,
   q1_AVG := round(apply(.SD, 1, mean), 2), 
   .SDcols = grep("^q1", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77

The case with the weighted average can be calculated using matrix multiplication;

dt[i = TRUE,
   q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2), 
   .SDcols = grep("^q2", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77    0.80
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73    0.78
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74    0.82
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72    0.87
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75    0.77
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78    0.80
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73    0.80
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75    0.78
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76    0.77
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77    0.82

Ephialtes answered 26/10, 2015 at 18:28 Comment(1)

I think if you use rowMeans on matrix and then the %*% on that matrix, would be efficient. In the code you showed, apply with MARGIN=1 is used, which is not that fast. Also, by converting .SD to matrix, it is not using the data.table techniques. – Dominate 26/10, 2015 at 18:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags