data.table do not compute NA groups in by
Asked Answered
S

1

9

This question has a partial answer here but the question is too specific and I'm not able to apply it to my own problem.

I would like to skip a potentially heavy computation of the NA group when using by.

library(data.table)

DT = data.table(X = sample(10), 
                Y = sample(10), 
                g1 = sample(letters[1:2], 10, TRUE),
                g2 = sample(letters[1:2], 10, TRUE))

set(DT, 1L, 3L, NA)
set(DT, 1L, 4L, NA)
set(DT, 6L, 3L, NA)
set(DT, 6L, 4L, NA)

DT[, mean(X*Y), by = .(g1,g2)]

Here we can see there are up to 5 groups including the (NA, NA) group. Considering that (i) the group is useless (ii) the groups can be very big and (iii) the actual computation is more complex than mean(X*Y) can I skip the group in an efficient way? I mean, without creating a copy of the remaining table. Indeed the following works.

DT2 = data.table:::na.omit.data.table(DT, cols = c("g1", "g2"))
DT2[, mean(X*Y), by = .(g1,g2)]
Stratocracy answered 19/3, 2018 at 15:35 Comment(1)
Near-duplicate for single-variable 'by' caseMcandrew
L
11

You can use an if clause:

DT[, if (!anyNA(.BY)) mean(X*Y), by = .(g1,g2)]

   g1 g2       V1
1:  b  a 25.75000
2:  a  b 24.00000
3:  b  b 35.33333

From the ?.BY help:

.BY is a list containing a length 1 vector for each item in by. This can be useful [...] to branch with if() depending on the value of a group variable.

Leuco answered 19/3, 2018 at 15:43 Comment(7)
I think it would be nice to have syntax like DT[, mean(X*Y), by=.(g1,g2), having=!anyNA(.BY)], requested here github.com/Rdatatable/data.table/issues/788Leuco
I was about to post DT[rowSums(is.na(DT[, .(g1,g2)])) == 0, mean(X*Y), by = .(g1,g2)], but this is much faster.Caterina
@Leuco I tried but couldn't understand the role of .BY here. Please help me understand.Rhumb
@Manish You can try DT[, {print(.BY); cat("next group...\n")}, by=.(g1, g2)] to get intuition for what it is. If that doesn't clear it up, I've invited you to a chat room to follow upLeuco
Works fine thank you. Considering that data.table recycle the memory allocated for the biggest group I guess this does no change anything in the memory allocation. It just skips the computation. Fair enough :-)Stratocracy
That's quite brilliant, it should be documented more widely in recipes and cheat-sheets.Mcandrew
@Mcandrew Thanks :) Fwiw, here's my cheat sheet including .BY franknarf1.github.io/r-tutorial/_book/… though not the j = if (...) ...Leuco

© 2022 - 2024 — McMap. All rights reserved.