When should I use the := operator in data.table?
Asked Answered
T

1

92

data.table objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?

Terresaterrestrial answered 11/8, 2011 at 17:1 Comment(0)
A
99

Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame but doesn't copy the entire table each time.

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)

system.time(for (i in 1:1000) DF[i,1] <- i)
     user  system elapsed 
  287.062 302.627 591.984 

system.time(for (i in 1:1000) DT[i,V1:=i])
     user  system elapsed 
    1.148   0.000   1.158     ( 511 times faster )

Putting the := in j like that allows more idioms :

DT["a",done:=TRUE]   # binary search for group 'a' and set a flag
DT[,newcol:=42]      # add a new column by reference (no copy of existing data)
DT[,col:=NULL]       # remove a column by reference

and :

DT[,newcol:=sum(v),by=group]  # like a fast transform() by group

I can't think of any reasons to avoid := ! Other than, inside a for loop. Since := appears inside DT[...], it comes with the small overhead of the [.data.table method; e.g., S3 dispatch and checking for the presence and type of arguments such as i, by, nomatch etc. So for inside for loops, there is a low overhead, direct version of := called set. See ?set for more details and examples. The disadvantages of set include that i must be row numbers (no binary search) and you can't combine it with by. By making those restrictions set can reduce the overhead dramatically.

system.time(for (i in 1:1000) set(DT,i,"V1",i))
     user  system elapsed 
    0.016   0.000   0.018
Augustusaugy answered 11/8, 2011 at 17:18 Comment(15)
Thanks for developing this package. I have a feeling I'm going to be revising a lot of my code to use this package.Skimmer
Great. Happy to help further. People who revise their code to use data.table often find the amount of code collapses down considerably (easier to debug and maintain), see reviews on Crantastic. I would have loved to fix <- in R (and other things) so you wouldn't need to change any code, and I haved posted to r-devel, but I can't see a way to make an omelette without breaking some eggs (sorry!)Augustusaugy
@Matthew Overloading the <- operator would make a great question.Terresaterrestrial
@gsk3 If the question is why I didn't do that, yes that's a great question. You ask, I'll answer :)Augustusaugy
On chat I was asked to self ask/answer (which apparently is encouraged) - that question is hereAugustusaugy
@MatthewDowle Want to include an explanation of when not to use := and to use set() instead?Terresaterrestrial
@Ari Just saw your comment, not sure how I missed it. Good idea - now added.Augustusaugy
@MatthewDowle I'd +1 again if I could.Terresaterrestrial
@MattDowle Why the difference in parentheses-use for referencing a column name between the set(DT, i, "V1", i) command (you must use parentheses) and the basic DT[, V1] (where you don't use parentheses)?Lenrow
@jabberwocky Where you say "parentheses", did you mean "quotes"? i.e. why V1 in one but "V1" in the other? Or are you asking about ( vs [?Augustusaugy
@MattDowle My mistake, sorry. I mean quotes.Lenrow
@jabberwocky No worries, ok let's see. The 3rd argument of set() is a column name only (as defined in ?set). You might want this to be literal (e.g. "V1") or held in a variable (e.g. colName which may then contain "V1", "colA" or another columns name). The second argument inside DT[,] is always an expression evaluated within the scope of the data.table. DT[,V1] is the simplest case, but things like DT[,V1*V2] and DT[,sum(V1)] are more common. Does that help?Augustusaugy
@jabberwocky It may help to consider that DT[,"V1"] returns simply "V1". This is explained by the very first FAQ 1.1. There is no point of DT[,"V1"] really. The behaviour is like that for consistency (i.e. the 2nd argument is always evaluated within scope of the data.table, even in this case) which users requested. It soon becomes natural to use DT[,V1] instead.Augustusaugy
@MattDowle Sorry for not being clear. I've read the FAQ and understand why I don't use the quotes (even though I understand even more now after your explanation!). I meant for my question to be about why you do use quotes in set(DT, i, "V1", i).Lenrow
@jabberwocky No problem. set(DT, i, "V1", i) sets the "V1" column whilst set(DT, i, colVar, i) sets the column name contained in the colVar variable (e.g. if colVar = "V1" was done earlier). The quotes indicate to take the column name literally rather than lookup the variable.Augustusaugy

© 2022 - 2024 — McMap. All rights reserved.