Non-equi join in tidyverse

Asked 25/2, 2021 at 16:39 Answered 25/8, 2022 at 13:7

I was wondering whether someone knows if the dplyr extension packages (dbplyr and dtplyr) allow non-equi joins within the usual dplyr workflow? I rarely need data.table, but fast non-equi joins are the only moments where I always need to setDT, then perform the join and then switch back with as_tibble(). I scrolled through the issues in the packages on github but didn't find whether this is something that's planned or already implemented.

Gene answered 25/2, 2021 at 16:39 Comment(4)

Interesting, but not yet pushed forward : github.com/tidyverse/dtplyr/issues/133 – Erythrocytometer 25/2, 2021 at 17:40

r-bloggers.com/2021/02/the-unequalled-joy-of-non-equi-joins – Erythrocytometer 25/2, 2021 at 17:48

You might consider the fuzzyjoin package. – Overpay 25/2, 2021 at 19:15

Inner join on means cross join where. Cross join means inner join on true. – Latakia 1/3, 2021 at 2:54

Non-equi joins are available since 1.1.0 via the function join_by. To create non-equi joins, you can use <, >, >=, <=, or helpers between, within, overlaps and nearest.

library(dplyr)

#Example from https://github.com/tidyverse/dplyr/pull/5910.

set.seed(123)
dates <- as.Date("2019-01-01") + 0:4
needles <- tibble(dates = dates, x = sample(length(dates)))

set.seed(123)
lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
upper <- lower + sample(2, 5, replace = TRUE)
haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))

needles
#> # A tibble: 5 x 2
#>   dates          x
#>   <date>     <int>
#> 1 2019-01-01     3
#> 2 2019-01-02     2
#> 3 2019-01-03     5
#> 4 2019-01-04     4
#> 5 2019-01-05     1
haystack
#> # A tibble: 5 x 3
#>   lower      upper          y
#>   <date>     <date>     <int>
#> 1 2019-01-04 2019-01-06     1
#> 2 2019-01-07 2019-01-08     2
#> 3 2019-01-04 2019-01-05     3
#> 4 2019-01-03 2019-01-05     4
#> 5 2019-01-03 2019-01-05     5

# Non-equi join
# For each row in `needles`, find locations in `haystack` matching the condition
left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 12 x 5
#>    dates          x lower      upper          y
#>    <date>     <int> <date>     <date>     <int>
#>  1 2019-01-01     3 NA         NA            NA
#>  2 2019-01-02     2 NA         NA            NA
#>  3 2019-01-03     5 2019-01-03 2019-01-05     4
#>  4 2019-01-03     5 2019-01-03 2019-01-05     5
#>  5 2019-01-04     4 2019-01-04 2019-01-06     1
#>  6 2019-01-04     4 2019-01-04 2019-01-05     3
#>  7 2019-01-04     4 2019-01-03 2019-01-05     4
#>  8 2019-01-04     4 2019-01-03 2019-01-05     5
#>  9 2019-01-05     1 2019-01-04 2019-01-06     1
#> 10 2019-01-05     1 2019-01-04 2019-01-05     3
#> 11 2019-01-05     1 2019-01-03 2019-01-05     4
#> 12 2019-01-05     1 2019-01-03 2019-01-05     5

Everyone answered 25/8, 2022 at 13:7 Comment(0)

There has been a new option for this in dbplyr since the version 1.4.0: sql_on. Citing Kirill Müller:

There is #2240 for dplyr, but it's going to take a while. For databases we already have a workaround [ie. generic SQL joins].

library(dplyr)
library(dbplyr)
tbl1 <- memdb_frame(a = 1:3, b = 4:2)
tbl2 <- memdb_frame(c = 1:3, b = 2:0)
left_join(tbl1, tbl2, sql_on = "LHS.b < RHS.c")

Kibler answered 26/11, 2021 at 15:47 Comment(0)

For dbplyr: While SQL supports non-equi joins, I have not found a dplyr approach that is equivalent. My usual work around is very similar to the r-bloggers link posted by @Waldi to join on the equality conditions and then filter on the inequality conditions.

For example:

output = join(df1, df2, by = c("df1_id" = "df2_id")) %>%
  filter(df1_date <= df2_date)

This translates to SQL similar to:

SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
WHERE df1_date <= df2_date

Which is not that different from:

SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
AND df1_date <= df2_date

Houri answered 25/2, 2021 at 21:37 Comment(2)

"Which is not that different from:" Thanks for the answer. Are there cases when the two differ? – Hyderabad 23/2, 2022 at 9:50

There may be some cases when using a left, right, or outer join that the two return different output. Suppose we used a left join and there was a record where df1_date was NULL - in an inequality join this record would be part of the output with nothing joined to it, but using the WHERE clause this record might be dropped. However, in these kinds of cases you probably want to handle NULL values explicitly. – Houri 23/2, 2022 at 19:54

Recommended topics

Hot tags