data.table
objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?
Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame
but doesn't copy the entire table each time.
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
user system elapsed
287.062 302.627 591.984
system.time(for (i in 1:1000) DT[i,V1:=i])
user system elapsed
1.148 0.000 1.158 ( 511 times faster )
Putting the :=
in j
like that allows more idioms :
DT["a",done:=TRUE] # binary search for group 'a' and set a flag
DT[,newcol:=42] # add a new column by reference (no copy of existing data)
DT[,col:=NULL] # remove a column by reference
and :
DT[,newcol:=sum(v),by=group] # like a fast transform() by group
I can't think of any reasons to avoid :=
! Other than, inside a for
loop. Since :=
appears inside DT[...]
, it comes with the small overhead of the [.data.table
method; e.g., S3 dispatch and checking for the presence and type of arguments such as i
, by
, nomatch
etc. So for inside for
loops, there is a low overhead, direct version of :=
called set
. See ?set
for more details and examples. The disadvantages of set
include that i
must be row numbers (no binary search) and you can't combine it with by
. By making those restrictions set
can reduce the overhead dramatically.
system.time(for (i in 1:1000) set(DT,i,"V1",i))
user system elapsed
0.016 0.000 0.018
<-
operator would make a great question. –
Terresaterrestrial set(DT, i, "V1", i)
command (you must use parentheses) and the basic DT[, V1]
(where you don't use parentheses)? –
Lenrow V1
in one but "V1"
in the other? Or are you asking about (
vs [
? –
Augustusaugy set()
is a column name only (as defined in ?set
). You might want this to be literal (e.g. "V1"
) or held in a variable (e.g. colName
which may then contain "V1"
, "colA"
or another columns name). The second argument inside DT[,]
is always an expression evaluated within the scope of the data.table. DT[,V1]
is the simplest case, but things like DT[,V1*V2]
and DT[,sum(V1)]
are more common. Does that help? –
Augustusaugy DT[,"V1"]
returns simply "V1"
. This is explained by the very first FAQ 1.1. There is no point of DT[,"V1"]
really. The behaviour is like that for consistency (i.e. the 2nd argument is always evaluated within scope of the data.table, even in this case) which users requested. It soon becomes natural to use DT[,V1]
instead. –
Augustusaugy set(DT, i, "V1", i)
. –
Lenrow set(DT, i, "V1", i)
sets the "V1"
column whilst set(DT, i, colVar, i)
sets the column name contained in the colVar
variable (e.g. if colVar = "V1"
was done earlier). The quotes indicate to take the column name literally rather than lookup the variable. –
Augustusaugy © 2022 - 2024 — McMap. All rights reserved.