How to define data.table keys for fastest aggregation using multiple keys
Asked Answered
E

1

8

I am trying to better understand utilizing keyd data.tables. After reading the documentation I think I understand how to speed up subsetting when using one key. For example:

DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=c(1,3,6), v=1:9)

Option one:

DT[x == "ad"]

Option two:

setkey(DT,x)
DT["ad"]

In this case option one is much slower than option two, because the data.table uses the key to seach more efficiently (using a binary search vs. a vector scan, which I do not understand but I will trust is faster.)

In the case of aggregating on a subset of the data using a by statement, what is the fastest way to define the key? Should I key the column that I am using to subset the data, or the column that defines the groups? For example:

setkey(DT,x)
DT[!"bd",sum(v),by=y]

or

setkey(DT,y)
DT[!"bd",sum(v),by=y]

Is there a way to utilize a key for both x and y?

EDIT

Does setting the key to both x and y perform two vector searches? i.e:

setkey(DT,x,y)

EDIT2

Sorry, what I meant to ask was will the call DT[!"bd",sum(v),by=y] perform two binary scans when DT is keyed by both x and y?

Emerick answered 14/11, 2013 at 19:21 Comment(5)
perhaps you're looking for this: https://mcmap.net/q/677835/-subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search-not-vector-scan - a cautionary note - option one is faster than option two, i.e. if all you're doing is a single look-up and don't have a key set yet, a simple vector scan is going to be fasterNevus
@Nevus So a vector scan is faster than setting the key then doing the search? In other words I should not set a new key before I do every search on a new column? And I saw the answer you referenced, but I was curious about using one key for the i argument and a different key for the by assignment.Emerick
to answer your edit - setkey sorts all the columns (going last to first), and so will at least perform nrows * num_keycols operationsNevus
@Nevus Sorry for all the confusion. I am still getting used to the new vocabulary and am still struggling to understand how to most efficiently use the data.table package. I have read the 10 min intro, FAQ, and most of the documentation. I've also been browsing a lot of SO questions, but I am still feeling iffy overall.Emerick
np; re edit2: if the key is x,y it will just do one binary search for the !"bd" part. There is no binary search involved in the by expression afaik (there is a sort there that is sometimes avoided, e.g. if you did by=x, but not in this case).Nevus
V
1

I believe it is not possible to perform two binary scans when the data table DT is keyed by both x and y. Instead I would repeat the keying first on x and then on y as follows:

DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=as.character(c(1,3,4)), v=1:9)
setkey(DT,x)
tmp = DT[!"bd"]
setkey(tmp,y)
tmp[!"1",sum(v),by=y]
Vitellin answered 25/8, 2014 at 16:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.