dplyr on data.table, am I really using data.table?
Asked Answered
B

4

96

If I use dplyr syntax on top of a datatable, do I get all the speed benefits of datatable while still using the syntax of dplyr? In other words, do I mis-use the datatable if I query it with dplyr syntax? Or do I need to use pure datatable syntax to harness all of its power.

Code Example:

library(data.table)
library(dplyr)

diamondsDT <- data.table(ggplot2::diamonds)
setkey(diamondsDT, cut) 

diamondsDT %>%
    filter(cut != "Fair") %>%
    group_by(cut) %>%
    summarize(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = n()) %>%
    arrange(desc(Count))

Results:

#         cut AvgPrice MedianPrice Count
# 1     Ideal 3457.542      1810.0 21551
# 2   Premium 4584.258      3185.0 13791
# 3 Very Good 3981.760      2648.0 12082
# 4      Good 3928.864      3050.5  4906

Here is the datatable equivalence I came up with. Not sure if it complies to DT good practice. But I wonder if the code is really more efficient than dplyr syntax behind the scene:

diamondsDT [cut != "Fair"
        ] [, .(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = .N), by=cut
        ] [ order(-Count) ]
Backandforth answered 16/12, 2014 at 18:35 Comment(5)
Why wouldn't you use data table syntax? It's elegant and efficient as well. The question is not really answerable as it is very broad. Yes, there are dplyr methods for data tables, but data table also has its own comparable methodsKelliekellina
I can use datatable syntax or course. But somehow, I find the dplyr syntax more elegant. Regardless of my preference for the syntax. What I really want to know is: do I need to use pure datatable syntax in order to get 100% benefits of datatable power.Backandforth
For a recent benchmark where dplyr is used on data.frames and corresponding data.tables, see here (and references therein).Zonnya
@Backandforth - I would think the answer to that question is definitely "Yes"Kelliekellina
@Henrik: I realized later that I had misinterpreted that page because they only displayed the code for the dataframe construction but not the code they used for data.table construction. When I realized it, I deleted my comment (hoping you hadn't seen it).Affiche
G
81

There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.

Operations involving i (== filter() and slice() in dplyr)

Assume DT with say 10 columns. Consider these data.table expressions:

DT[a > 1, .N]                    ## --- (1)
DT[a > 1, mean(b), by=.(c, d)]   ## --- (2)

(1) gives the number of rows in DT where column a > 1. (2) returns mean(b) grouped by c,d for the same expression in i as (1).

Commonly used dplyr expressions would be:

DT %>% filter(a > 1) %>% summarise(n())                        ## --- (3) 
DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)

Clearly, data.table codes are shorter. In addition they are also more memory efficient1. Why? Because in both (3) and (4), filter() returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columns b, c, d for the successive operations. To overcome this, we have to select() columns apriori:

DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5)
DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)

It is essential to highlight a major philosophical difference between the two packages:

  • In data.table, we like to keep these related operations together, and that allows to look at the j-expression (from the same function call) and realise there's no need for any columns in (1). The expression in i gets computed, and .N is just sum of that logical vector which gives the number of rows; the entire subset is never realised. In (2), just column b,c,d are materialised in the subset, other columns are ignored.

  • But in dplyr, the philosophy is to have a function do precisely one thing well. There is (at least currently) no way to tell if the operation after filter() needs all those columns we filtered. You'll need to think ahead if you want to perform such tasks efficiently. I personally find it counter-intutitive in this case.

Note that in (5) and (6), we still subset column a which we don't require. But I'm not sure how to avoid that. If filter() function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).

Sub-assign by reference

dplyr will never update by reference. This is another huge (philosophical) difference between the two packages.

For example, in data.table you can do:

DT[a %in% some_vals, a := NA]

which updates column a by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.

But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.

Other functionalities

  • In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.

  • data.table's rolling joins feature is not supported in dplyr's syntax as well.

  • We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function foverlaps() at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).

    But ultimately, our goal is to integrate it into [.data.table so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.

  • Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex: DT[x == 1] and DT[x %in% some_vals] will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search. This feature will continue to evolve. Check this gist for a short overview of this feature.

    From the way filter() is implemented for data.tables, it doesn't take advantage of this feature.

  • A dplyr feature is that it also provides interface to databases using the same syntax, which data.table doesn't at the moment.

So, you will have to weigh in these (and probably other points) and decide based on whether these trade-offs are acceptable to you.

HTH


(1) Note that being memory efficient directly impacts speed (especially as data gets larger), as the bottleneck in most cases is moving the data from main memory onto cache (and making use of data in cache as much as possible - reduce cache misses - so as to reduce accessing main memory). Not going into details here.

Githens answered 17/12, 2014 at 7:54 Comment(2)
That's a good answer, but it would be possible (if not likely) for dplyr to implement an efficient filter() plus summarise() using the same approach that dplyr uses for SQL - i.e. building up an expression and then only executing once on demand. It's unlikely this will be implemented in the near future because dplyr is fast-enough for me and implementing a query planner/optimiser is relatively hard.Shericesheridan
Being memory efficient also helps in another important area - actually completing the task before running out of memory. When working with large datasets I've faced that issue with dplyr as well as pandas, whereas data.table would complete the job gracefully.Xever
S
25

Just try it.

library(rbenchmark)
library(dplyr)
library(data.table)

benchmark(
dplyr = diamondsDT %>%
    filter(cut != "Fair") %>%
    group_by(cut) %>%
    summarize(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = n()) %>%
    arrange(desc(Count)),
data.table = diamondsDT[cut != "Fair", 
                        list(AvgPrice = mean(price),
                             MedianPrice = as.numeric(median(price)),
                             Count = .N), by = cut][order(-Count)])[1:4]

On this problem it seems data.table is 2.4x faster than dplyr using data.table:

        test replications elapsed relative
2 data.table          100    2.39    1.000
1      dplyr          100    5.77    2.414

Revised based on Polymerase's comment.

Sextan answered 16/12, 2014 at 18:45 Comment(6)
Using the microbenchmark package, I found that running the OP's dplyr code on the original (data frame) version of diamonds took a median time of 0.012 seconds, while it took a median time of 0.024 seconds after converting diamonds to a data table. Running G. Grothendieck's data.table code took 0.013 seconds. At least on my system, it looks like dplyr and data.table have about the same performance. But why would dplyr be slower when the data frame is first converted to a data table?Mckinley
Dear G. Grothendieck, this is wonderful. Thank you for showing me this benchmark utility. BTW you forgot [order(-Count)] in the datatable version to make the equivalence of dplyr's arrange(desc(Count)). After adding this, datatable is still faster by about x1.8 (instead of 2.9).Backandforth
@Mckinley can you rerun your bench again with the datatable version here (added sorting by desc Count at the last step): diamondsDT[cut != "Fair", list(AvgPrice = mean(price), MedianPrice = as.numeric(median(price)), Count = .N), by = cut ] [ order(-Count) ]Backandforth
Still 0.013 seconds. The ordering operation takes hardly any time because it's just reordering the final table, which has only four rows.Mckinley
There's some fixed overhead for the conversion from dplyr syntax to data table syntax, so it might be worth trying varying problem sizes. Also I may not have implemented the most efficient data table code in dplyr; patches are always welcomedShericesheridan
@hadley, see my answer; does that sound like it could be the issue? I'm not sure this is something easily patchable though.Parve
P
24

To answer your questions:

  • Yes, you are using data.table
  • But not as efficiently as you would with pure data.table syntax

In many cases this will be an acceptable compromise for those wanting the dplyr syntax, though it will possibly be slower than dplyr with plain data frames.

One big factor appears to be that dplyr will copy the data.table by default when grouping. Consider (using microbenchmark):

Unit: microseconds
                                                               expr       min         lq    median
                                diamondsDT[, mean(price), by = cut]  3395.753  4039.5700  4543.594
                                          diamondsDT[cut != "Fair"] 12315.943 15460.1055 16383.738
 diamondsDT %>% group_by(cut) %>% summarize(AvgPrice = mean(price))  9210.670 11486.7530 12994.073
                               diamondsDT %>% filter(cut != "Fair") 13003.878 15897.5310 17032.609

The filtering is of comparable speed, but the grouping isn't. I believe the culprit is this line in dplyr:::grouped_dt:

if (copy) {
    data <- data.table::copy(data)
}

where copy defaults to TRUE (and can't easily be changed to FALSE that I can see). This likely doesn't account for 100% of the difference, but general overhead alone on something the size of diamonds most likely isn't the full difference.

The issue is that in order to have a consistent grammar, dplyr does the grouping in two steps. It first sets keys on a copy of the original data table that match the groups, and only later does it group. data.table just allocates memory for the largest result group, which in this case is just one row, so that makes a big difference in how much memory needs to be allocated.

FYI, if anyone cares, I found this by using treeprof (install_github("brodieg/treeprof")), an experimental (and still very much alpha) tree viewer for Rprof output:

enter image description here

Note the above is currently only works on macs AFAIK. Also, unfortunately, Rprof records calls of the type packagename::funname as anonymous so it could actually be any and all of the datatable:: calls inside grouped_dt that are responsible, but from quick testing it looked like datatable::copy is the big one.

That said, you can quickly see how there isn't that much overhead around the [.data.table call, but there is also a completely separate branch for the grouping.


EDIT: to confirm the copying:

> tracemem(diamondsDT)
[1] "<0x000000002747e348>"    
> diamondsDT %>% group_by(cut) %>% summarize(AvgPrice = mean(price))
tracemem[0x000000002747e348 -> 0x000000002a624bc0]: <Anonymous> grouped_dt group_by_.data.table group_by_ group_by <Anonymous> freduce _fseq eval eval withVisible %>% 
Source: local data table [5 x 2]

        cut AvgPrice
1      Fair 4358.758
2      Good 3928.864
3 Very Good 3981.760
4   Premium 4584.258
5     Ideal 3457.542
> diamondsDT[, mean(price), by = cut]
         cut       V1
1:     Ideal 3457.542
2:   Premium 4584.258
3:      Good 3928.864
4: Very Good 3981.760
5:      Fair 4358.758
> untracemem(diamondsDT)
Parve answered 16/12, 2014 at 21:7 Comment(4)
This is awesome, thanks. Does that mean, dplyr::group_by() will double the memory requirement (compared to pure datatable syntax) because of the internal data copy step? Meaning if my datatable object size is 1GB, and I use the dplyr chained syntax similar to the one in the original post. I will need at least 2GB free memory to get the results?Backandforth
I feel like I fixed that in the dev version?Shericesheridan
@hadley, I was working from CRAN version. Looking at dev, looks like you partially addressed the issue, but the actual copy remains (haven't tested, just looking at lines c(20, 30:32) in R/grouped-dt.r. It's probably faster now, but I bet the slow step is the copy.Parve
I'm also waiting for a shallow copy function in data.table; until then I think it's better to be safe than fast.Shericesheridan
K
4

You can use dtplyr now, which is part of the tidyverse. It allows you to use dplyr style statements as usual, but utilizes lazy evaluation and translates your statements to data.table code under the hood. The overhead in translation is minimal, but you derive all, if not, most of the benefits of data.table. More details at the official git repo here and the tidyverse page.

Kaleena answered 7/11, 2019 at 17:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.