data.table do not compute NA groups in by

About

Asked 19/3, 2018 at 15:35 Answered 19/3, 2018 at 15:43

Solved r group-by data.table grouping na

This question has a partial answer here but the question is too specific and I'm not able to apply it to my own problem.

I would like to skip a potentially heavy computation of the NA group when using by.

library(data.table)

DT = data.table(X = sample(10), 
                Y = sample(10), 
                g1 = sample(letters[1:2], 10, TRUE),
                g2 = sample(letters[1:2], 10, TRUE))

set(DT, 1L, 3L, NA)
set(DT, 1L, 4L, NA)
set(DT, 6L, 3L, NA)
set(DT, 6L, 4L, NA)

DT[, mean(X*Y), by = .(g1,g2)]

Here we can see there are up to 5 groups including the (NA, NA) group. Considering that (i) the group is useless (ii) the groups can be very big and (iii) the actual computation is more complex than mean(X*Y) can I skip the group in an efficient way? I mean, without creating a copy of the remaining table. Indeed the following works.

DT2 = data.table:::na.omit.data.table(DT, cols = c("g1", "g2"))
DT2[, mean(X*Y), by = .(g1,g2)]

Stratocracy answered 19/3, 2018 at 15:35 Comment(1)

Near-duplicate for single-variable 'by' case – Mcandrew 19/4, 2018 at 23:51

You can use an if clause:

DT[, if (!anyNA(.BY)) mean(X*Y), by = .(g1,g2)]

   g1 g2       V1
1:  b  a 25.75000
2:  a  b 24.00000
3:  b  b 35.33333

From the ?.BY help:

.BY is a list containing a length 1 vector for each item in by. This can be useful [...] to branch with if() depending on the value of a group variable.

Leuco answered 19/3, 2018 at 15:43 Comment(7)

I think it would be nice to have syntax like DT[, mean(X*Y), by=.(g1,g2), having=!anyNA(.BY)], requested here github.com/Rdatatable/data.table/issues/788 – Leuco 19/3, 2018 at 15:44

I was about to post DT[rowSums(is.na(DT[, .(g1,g2)])) == 0, mean(X*Y), by = .(g1,g2)], but this is much faster. – Caterina 19/3, 2018 at 15:53

@Leuco I tried but couldn't understand the role of .BY here. Please help me understand. – Rhumb 19/3, 2018 at 16:46

@Manish You can try DT[, {print(.BY); cat("next group...\n")}, by=.(g1, g2)] to get intuition for what it is. If that doesn't clear it up, I've invited you to a chat room to follow up – Leuco 19/3, 2018 at 16:50

Works fine thank you. Considering that data.table recycle the memory allocated for the biggest group I guess this does no change anything in the memory allocation. It just skips the computation. Fair enough :-) – Stratocracy 19/3, 2018 at 17:18

That's quite brilliant, it should be documented more widely in recipes and cheat-sheets. – Mcandrew 19/4, 2018 at 23:47

@Mcandrew Thanks :) Fwiw, here's my cheat sheet including .BY franknarf1.github.io/r-tutorial/_book/… though not the j = if (...) ... – Leuco 20/4, 2018 at 0:33

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags