Break data.table chain into two lines of code for readability

Asked 17/11, 2015 at 16:17 Answered 9/4, 2022 at 23:29

I'm working on a Rmarkdown document, and was told to strictly limit to a maximum number of columns (margin column) of 100. In the document's code chunks I used many different packages, among which is data.table.

In order to comply with the limit I can split chains (and even long commands) like:

p <- ggplot(foo,aes(bar,foo2))+
       geom_line()+
       stat_smooth()
bar <- sum(long_variable_name_here,
         na.rm=TRUE)
foo <- bar %>% 
         group_by(var) %>%
         summarize(var2=sum(foo2))

but I can't split a data.table chain, as it produces an error. How can I achieve something like this?

bar <- foo[,.(long_name_here=sum(foo2)),by=var]
           [order(-long_name_here)]

Last line, of course, causes an error. Thanks!

Rowlett answered 17/11, 2015 at 16:17 Comment(2)

Lots of ways to do this, the key as noted by @Jaap is to carry over your closing ]; from there, it's up to personal taste exactly how you'd like to slice-and-dice – Playwriting 17/11, 2015 at 16:30

@Stefano your link points to this very question. You probably confused the correct link? – Rowlett 6/5 at 20:7

You have to give a return between the [ and ] of each line. An example for how to divide your data.table code over several lines:

bar <- foo[, .(long_name_here = sum(foo2)), by = var
           ][order(-long_name_here)]

You can also give a return before / after each comma. An example with a return before the comma (my preference):

bar <- foo[, .(long_name_here = sum(foo2))
           , by = var
           ][order(-long_name_here)
             , long_name_2 := long_name_here * 10]

See this answer for an extended example

Lachrymator answered 17/11, 2015 at 16:18 Comment(0)

Chaining data.tables with magrittr

I have a method I'm using, with magrittr, using the . object with [:

library(magrittr)
library(data.table)

bar <- foo %>%
        .[etcetera] %>%
        .[etcetera] %>%
        .[etcetera]

working example:

out <- data.table(expand.grid(x = 1:10,y = 1:10))
out %>% 
  .[,z := x*y] %>% 
  .[,w := x*z] %>% 
  .[,v := w*z]
print(out)

Additional examples

Edit: it's also not just syntactic sugar, since it allows you to refer to the table from the previous step as ., which means that you can do a self join,

or you can use %T>% for some logging in-between steps (using futile.logger or the like):

out %>%
 .[etcetera] %>%
 .[etcetera] %T>% 
 .[loggingstep] %>%
 .[etcetera] %>%
 .[., on = SOMEVARS, allow.cartesian = TRUE]

EDIT:

~~This is much later, and I still use this regularly. But I have the following caveat:~~

magrittr adds overhead

I really like doing this at the top level of a script. It has a very clear and readable flow, and there are a number of neat tricks you can do with it.

But I've had to remove this before when optimizing if it's part of a function that's being called lots of times.

You're better off chaining data.tables the old fashioned way in that case.

EDIT 2: Well, I'm back here to say that it doesn't add much overhead, I just tried benchmarking it on a few tests, but can't really find any major differences:

library(magrittr)
library(data.table)
toplevel <- data.table::CJ(group = 1:100, sim = 1:100, letter = letters)
toplevel[, data := runif(.N)]

processing_method1 <- function(dt) {
  dt %>% 
    .[, mean(data), by = .(letter)] %>%
    .[, median(V1)]
}

processing_method2 <- function(dt) {
  dt[, mean(data), by = .(letter)][, median(V1)]
}

microbenchmark::microbenchmark(
  with_pipe = toplevel[, processing_method1(.SD), by = group],
  without_pipe = toplevel[, processing_method2(.SD), by = group]
)

Unit: milliseconds
         expr      min       lq      mean   median       uq      max neval
    with_pipe 87.18837 91.91548 101.96456 100.7990 106.2750 230.5221   100
 without_pipe 86.81728 90.74838  98.43311  99.2259 104.6146 129.8175   100```

Almost no overhead here

Katheykathi answered 26/4, 2016 at 19:8 Comment(4)

An advantage of this approach is that it's easy to run a subset of your chain, when testing. – Hoarding 4/4, 2023 at 0:29

How much overhead does the chain add? – Hoarding 4/4, 2023 at 0:29

I meant to come back here and edit the post with a microbenchmark test, but I can't actually reproduce the overhead with new versions of the package – Katheykathi 15/6, 2023 at 7:59

You can also use the fastpipe package with this approach, works great – Hoarding 16/6, 2023 at 1:25

You have to give a return between the [ and ] of each line. An example for how to divide your data.table code over several lines:

bar <- foo[, .(long_name_here = sum(foo2)), by = var
           ][order(-long_name_here)]

You can also give a return before / after each comma. An example with a return before the comma (my preference):

bar <- foo[, .(long_name_here = sum(foo2))
           , by = var
           ][order(-long_name_here)
             , long_name_2 := long_name_here * 10]

See this answer for an extended example

Lachrymator answered 17/11, 2015 at 16:18 Comment(0)

For many years, the way that automatic indentation in RStudio mis-aligns data.table pipes has been a source of frustration to me. I only recently realized that there is a neat way to get around this, simply by enclosing the piped operations in parentheses.

Here's a simple example:

x <- data.table(a = letters, b = LETTERS[1:5], c = rnorm(26))
y <- (
  x
  [, c := round(c, 2)]
  [sample(26)]
  [, d := paste(a,b)]
  [, .(d, foo = mean(c)), by = b]
  )

Why does this work? Because the un-closed parenthesis signals to the R interpreter that the current line is still not complete, and therefore the whole pipe is treated in the same way as a continuous line of code.

Foreign answered 9/4, 2022 at 23:29 Comment(6)

Nice addition, +1! – Lachrymator 16/6, 2022 at 8:2

Phenomenal, this is a useful and time-saving tip. – Nuthouse 28/9, 2022 at 10:55

how could you avoid modifying x? – Hoarding 31/3, 2023 at 1:52

@SimonWoodward - the := operator is a data.table feature (one of its most useful) that is specifically designed to update x. It will do that whether you chain operations together or not, and whether you use this format or not. So, your question is not really related to this Q&A, but is more general. To avoid updating X, you could (1) not use the := operator; or (2) begin the pipe with copy(x). – Foreign 31/3, 2023 at 11:57

In this situation often x is a big dataframe (e.g. 1 Gb) that I do not want to mess with (or copy if I can help it). So I just = instead of :=? – Hoarding 1/4, 2023 at 23:19

@SimonWoodward I think you need to post this as a new question, providing a minimal reproducible example and explaining exactly what you try to achieve. Comments here are not the right place to ask this question. Once you post you new question, please let me know, so we can both delete this comment thread from here. – Foreign 1/4, 2023 at 23:46

Recommended topics

Hot tags