rolling joins data.table in R
Asked Answered
H

1

47

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:

dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")

I expected this to produce a long data.table where the values in dt2 are rolled:

dt1[dt2,roll=TRUE]

Instead, the correct way to do this seems to be:

dt2[dt1,roll=TRUE]

Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.

Additionally the documentation says:

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) 
as an index.

This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.

Harsho answered 19/8, 2012 at 23:59 Comment(1)
I've written a blog post describing how rolling joins work here gormanalysis.com/r-data-table-rolling-joinsHendley
B
29

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?

roll Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.

rolltolast Like roll but the data is not rolled forward past the last observation within each group defined by the join columns. The value of i must fall in a gap in x but not after the end of the data, for that group defined by all but the last join column. roll and rolltolast may not both be TRUE.

In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax in base. That's quite a long answer so I won't paste it here.

Burnsed answered 20/8, 2012 at 14:36 Comment(7)
i did not see that, thank you for posting. what confuses me is that x[y] returns rows that are in y but not in x? this doesn't seem like an equi-join... in particular x[y] and merge(x, y, all.y=TRUE) seem to be the same no?Harsho
Oh I see what you mean. nomatch is by default NA, so that's an outer join by default. Set 'nomatch' to 0 for inner join. The term equi-join is meant to convey that = occurs in the (analogous) where clause, as opposed to roll which is analogous to <=. Equi join isn't analogous to innner join.Burnsed
Also, to confuse you further, seach ?data.table for by without by. Since mult is "all" by default, what's really happening in X[Y] is grouping, and it's putting the join columns from Y into the result (which is why you think it's returning rows from Y). Once you understand that, move on to join inherited scope.Burnsed
Have you seen and followed the tip at the top of ?data.table to run example(data.table) and work through the examples at the prompt?Burnsed
i have, yes, but i was still baffled unfortunately. i started coming up with my own examples to try to understand the logic but there seemed to be contradictory things going on so i couldn't figure out what was happening. this makes it a bit clearer although i might have more questions down the line.Harsho
rolltolast is being replaced by rollends in v1.9.6. : ` 'rolltolast' has been marked 'deprecated' in ?data.table since v1.8.8 on CRAN 3 Mar 2013, see NEWS. Please change to the more flexible 'rollends' instead. 'rolltolast' will be removed in the next version.`Contuse
@Contuse +1 thanks. Any chance you could edit the answer directly would be faster.Burnsed

© 2022 - 2024 — McMap. All rights reserved.