While preparing an answer to the question dplyr or data.table to calculate time series aggregations in R I noticed that I do get different results depending on whether the table is updated in place or returned as a new object. Also, I do get different result when I change the order of columns in the non-equi join conditions.
Currently, I don't have an explanation for this, perhaps due to a major misunderstanding on my side or a simple coding error.
Please, note that this question is asking particularly for explanations of the observed behaviour of
data.table
joins. If you have alternative solutions to the underlying problem, please, feel free to post an answer to the original question.
Original question and working answer
The original question was how to count the number of hospitalizations occuring in the 365 days before that hospitalization (including the actual one) for each patient using these data:
library(data.table) # version 1.10.4 (CRAN) or 1.10.5 (devel built 2017-08-19)
DT0 <- data.table(
patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", "2016/1/7",
"2015/12/20", "2015/12/25", "2016/2/10")))
setorder(DT0, patient.id, hospitalization.date)
DT0
patient.id hospitalization.date 1: 1 2013-10-15 2: 1 2015-07-16 3: 1 2016-01-07 4: 2 2014-10-15 5: 2 2015-12-20 6: 2 2015-12-25 7: 2 2016-02-10
The code below gives the expected answer (additional helper column added here for clarity)
# add helper columns
DT0[, start.date := hospitalization.date - 365][
, end.date := hospitalization.date][]
DT0
patient.id hospitalization.date start.date end.date 1: 1 2013-10-15 2012-10-15 2013-10-15 2: 1 2015-07-16 2014-07-16 2015-07-16 3: 1 2016-01-07 2015-01-07 2016-01-07 4: 2 2014-10-15 2013-10-15 2014-10-15 5: 2 2015-12-20 2014-12-20 2015-12-20 6: 2 2015-12-25 2014-12-25 2015-12-25 7: 2 2016-02-10 2015-02-10 2016-02-10
result <- DT0[DT0, on = c("patient.id", "hospitalization.date>=start.date",
"hospitalization.date<=end.date"),
.(hospitalizations.last.year = .N), by = .EACHI][]
result
patient.id hospitalization.date hospitalization.date hospitalizations.last.year 1: 1 2012-10-15 2013-10-15 1 2: 1 2014-07-16 2015-07-16 1 3: 1 2015-01-07 2016-01-07 2 4: 2 2013-10-15 2014-10-15 1 5: 2 2014-12-20 2015-12-20 1 6: 2 2014-12-25 2015-12-25 2 7: 2 2015-02-10 2016-02-10 3
except for the renamed and duplicated column names (which are left as is for comparison).
For patient.id == 2
, the result in last row is 3 because the patient was hospitalized on 2016-02-10 for the third time since 2015-02-10.
Update on join in place
result
is a new data.table
object which occupies additional memory. I tried to update the original data.table
object in place using:
# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "hospitalization.date>=start.date",
"hospitalization.date<=end.date"),
hospitalizations.last.year := .N, by = .EACHI]
DT
patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 2 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 3 7: 2 2016-02-10 2015-02-10 2016-02-10 3
DT
has now been updated in place but rows 5 and 6 show now 3 hospitalizations instead of 1 or 2, resp. It seems that now the total number of hospitalizations within the last period is returned for each of the rows.
Change the order of columns in the conditions.
Also the order of columns in the non-equi join conditions do matter, even in a self-join:
result <- DT0[DT0, on = c("patient.id", "start.date<=hospitalization.date",
"end.date>=hospitalization.date"),
.(hospitalizations.last.year = .N), by = .EACHI][]
result
My expectation was that "start.date<=hospitalization.date"
would be equivalent to "hospitalization.date>=start.date"
(note that also <
and >
were switched) but the result
patient.id start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2013-10-15 1 2: 1 2015-07-16 2015-07-16 2 3: 1 2016-01-07 2016-01-07 1 4: 2 2014-10-15 2014-10-15 1 5: 2 2015-12-20 2015-12-20 3 6: 2 2015-12-25 2015-12-25 2 7: 2 2016-02-10 2016-02-10 1
is different. It seems that now the number of forthcoming hospitalizations are being counted
Interestingly, the update in place does now return the same result (except for some of the column names):
# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "start.date<=hospitalization.date",
"end.date>=hospitalization.date"),
hospitalizations.last.year := .N, by = .EACHI]
DT
patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 1 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 2 7: 2 2016-02-10 2015-02-10 2016-02-10 1
Related
There is a potentially related question which led to an issue reported on github.
There is an answer by Arun concerning usage of the x.
prefix with non-equi joins.
.GRP
beforex.h.date
inj
, data.table can't findx.h.date
... will report. – Shaper