Overview
I'm relatively familiar with data.table
, not so much with dplyr
. I've read through some dplyr
vignettes and examples that have popped up on SO, and so far my conclusions are that:
data.table
anddplyr
are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)dplyr
has more accessible syntaxdplyr
abstracts (or will) potential DB interactions- There are some minor functionality differences (see "Examples/Usage" below)
In my mind 2. doesn't bear much weight because I am fairly familiar with data.table
, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table
. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).
Question
What I want to know is:
- Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
- Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.
One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr
would offer much beyond what I can already do in data.table
. Here is the dplyr
solution (data at end of Q):
dat %.%
group_by(name, job) %.%
filter(job != "Boss" | year == min(year)) %.%
mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a data.table
solution. That said, good data.table
solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):
setDT(dat)[,
.SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by=list(id, job)
]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table
(i.e. doesn't use some of the more esoteric tricks).
Ideally what I'd like to see is some good examples were the dplyr
or data.table
way is substantially more concise or performs substantially better.
Examples
Usage
dplyr
does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around usingdo
in the answer to @eddi's question).data.table
supports rolling joins (thanks @dholstius) as well as overlap joinsdata.table
internally optimises expressions of the formDT[col == value]
orDT[col %in% values]
for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.dplyr
offers standard evaluation versions of functions (e.g.regroup
,summarize_each_
) that can simplify the programmatic use ofdplyr
(note programmatic use ofdata.table
is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
Benchmarks
- I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point
data.table
becomes substantially faster. - @Arun ran some benchmarks on joins, showing that
data.table
scales better thandplyr
as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values hasdata.table
~6x faster. - (Unverified) has
data.table
75% faster on larger versions of a group/apply/sort whiledplyr
was 40% faster on the smaller ones (another SO question from comments, thanks danas). - Matt, the main author of
data.table
, has benchmarked grouping operations ondata.table
,dplyr
and pythonpandas
on up to 2 billion rows (~100GB in RAM). - An older benchmark on 80K groups has
data.table
~8x faster
Data
This is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob",
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L,
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager",
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager",
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,
-16L))
dplyr
one is:as.data.table(dat)[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], by = list(name, job)]
– Piroshkidplyr
anddata.table
teams are working on benchmarks, so an answer will be there at some point. #2 (syntax) imO is strictly false, but that clearly ventures into opinion territory, so I'm voting to close as well. – Piroshki(d)plyr
has measure 0 – Piroshkidplyr
well enough to rule it out and was curious if someone would come up with a good counterexample. Also, nice simplification of thedata.table
formulation. – Brachylogyplyr
anddata.table
and was also closed. My answer there explains why this more. – Lainelaineydata.table
has measure 0, but that's not inconsistent with your belief ;) – Goniondata.table
joins at the end? And (probably repeating the obvious stated elsewhere) what about more interesting data sizes (where I'd define "interesting" as something that takes more than say a minute in base - which is generally the main reason people start exploring benchmarks of packages)? – Piroshkidplyr
andplyr
with regards to syntax and is basically the main reason why I dislike their syntax, is that I have to learn way too many (read more than 1) extra functions (with names that still don't make sense for me), remember what they do, what arguments they take, etc. That has always been a huge turn off for me from plyr-philosophy. – Piroshki.SD
). [seriously] I think these are legitimate design differences that will appeal to different people – Gonion.SD
et al - that's fair -.SD
took me a little while to understand, but by the time I got there, I was already able to do a lot, while (d)plyr presents you with a large barrier right upfront. – Piroshkidplyr
anddata.table
. – Ocarinadata.table
it is much less verbose than standard data frame and most of the operations are faster. It forces you to think with vectorial mindset (that has better performance). I see it as a good substitute of data frame. The syntax it not intuitive at first, but once you get to use it it is something easy to remember. I seedplyr
as a set of functions butdata.table
as a new class object with much better performance and concise syntax that the traditional data frame. – Makkahduckplyr
now offers dplyr syntax with much of the performance ofduckdb
, which is 3-6x faster thandata.table
in current benchmarks: duckdblabs.github.io/db-benchmark – Androgen