I was wondering whether someone knows if the dplyr
extension packages (dbplyr
and dtplyr
) allow non-equi joins within the usual dplyr workflow? I rarely need data.table
, but fast non-equi joins are the only moments where I always need to setDT
, then perform the join and then switch back with as_tibble()
. I scrolled through the issues in the packages on github but didn't find whether this is something that's planned or already implemented.
Non-equi joins are available since 1.1.0
via the function join_by
. To create non-equi joins, you can use <
, >
, >=
, <=
, or helpers between
, within
, overlaps
and nearest
.
library(dplyr)
#Example from https://github.com/tidyverse/dplyr/pull/5910.
set.seed(123)
dates <- as.Date("2019-01-01") + 0:4
needles <- tibble(dates = dates, x = sample(length(dates)))
set.seed(123)
lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
upper <- lower + sample(2, 5, replace = TRUE)
haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))
needles
#> # A tibble: 5 x 2
#> dates x
#> <date> <int>
#> 1 2019-01-01 3
#> 2 2019-01-02 2
#> 3 2019-01-03 5
#> 4 2019-01-04 4
#> 5 2019-01-05 1
haystack
#> # A tibble: 5 x 3
#> lower upper y
#> <date> <date> <int>
#> 1 2019-01-04 2019-01-06 1
#> 2 2019-01-07 2019-01-08 2
#> 3 2019-01-04 2019-01-05 3
#> 4 2019-01-03 2019-01-05 4
#> 5 2019-01-03 2019-01-05 5
# Non-equi join
# For each row in `needles`, find locations in `haystack` matching the condition
left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 12 x 5
#> dates x lower upper y
#> <date> <int> <date> <date> <int>
#> 1 2019-01-01 3 NA NA NA
#> 2 2019-01-02 2 NA NA NA
#> 3 2019-01-03 5 2019-01-03 2019-01-05 4
#> 4 2019-01-03 5 2019-01-03 2019-01-05 5
#> 5 2019-01-04 4 2019-01-04 2019-01-06 1
#> 6 2019-01-04 4 2019-01-04 2019-01-05 3
#> 7 2019-01-04 4 2019-01-03 2019-01-05 4
#> 8 2019-01-04 4 2019-01-03 2019-01-05 5
#> 9 2019-01-05 1 2019-01-04 2019-01-06 1
#> 10 2019-01-05 1 2019-01-04 2019-01-05 3
#> 11 2019-01-05 1 2019-01-03 2019-01-05 4
#> 12 2019-01-05 1 2019-01-03 2019-01-05 5
There has been a new option for this in dbplyr
since the version 1.4.0: sql_on
. Citing Kirill Müller:
There is #2240 for dplyr, but it's going to take a while. For databases we already have a workaround [ie. generic SQL joins].
library(dplyr)
library(dbplyr)
tbl1 <- memdb_frame(a = 1:3, b = 4:2)
tbl2 <- memdb_frame(c = 1:3, b = 2:0)
left_join(tbl1, tbl2, sql_on = "LHS.b < RHS.c")
For dbplyr: While SQL supports non-equi joins, I have not found a dplyr approach that is equivalent. My usual work around is very similar to the r-bloggers link posted by @Waldi to join on the equality conditions and then filter on the inequality conditions.
For example:
output = join(df1, df2, by = c("df1_id" = "df2_id")) %>%
filter(df1_date <= df2_date)
This translates to SQL similar to:
SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
WHERE df1_date <= df2_date
Which is not that different from:
SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
AND df1_date <= df2_date
df1_date
was NULL
- in an inequality join this record would be part of the output with nothing joined to it, but using the WHERE clause this record might be dropped. However, in these kinds of cases you probably want to handle NULL
values explicitly. –
Houri © 2022 - 2024 — McMap. All rights reserved.
fuzzyjoin
package. – Overpay