I am trying to better understand utilizing keyd data.table
s. After reading the documentation I think I understand how to speed up subsetting when using one key. For example:
DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=c(1,3,6), v=1:9)
Option one:
DT[x == "ad"]
Option two:
setkey(DT,x)
DT["ad"]
In this case option one is much slower than option two, because the data.table uses the key to seach more efficiently (using a binary search vs. a vector scan, which I do not understand but I will trust is faster.)
In the case of aggregating on a subset of the data using a by statement, what is the fastest way to define the key? Should I key the column that I am using to subset the data, or the column that defines the groups? For example:
setkey(DT,x)
DT[!"bd",sum(v),by=y]
or
setkey(DT,y)
DT[!"bd",sum(v),by=y]
Is there a way to utilize a key for both x
and y
?
EDIT
Does setting the key to both x
and y
perform two vector searches? i.e:
setkey(DT,x,y)
EDIT2
Sorry, what I meant to ask was will the call DT[!"bd",sum(v),by=y]
perform two binary scans when DT is keyed by both x and y?
setkey
sorts all the columns (going last to first), and so will at least performnrows * num_keycols
operations – Nevusdata.table
package. I have read the 10 min intro, FAQ, and most of the documentation. I've also been browsing a lot of SO questions, but I am still feeling iffy overall. – Emerickx,y
it will just do one binary search for the!"bd"
part. There is no binary search involved in theby
expression afaik (there is a sort there that is sometimes avoided, e.g. if you didby=x
, but not in this case). – Nevus