left_join
works as expected with NA values on tibbles or data frames, but on tbl it seems it does not match NAs, even with the option na_matches = "na".
R version and package versions
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
...
other attached packages:
[1] reprex_0.3.0 dbplyr_1.4.2 lubridate_1.7.4 magrittr_1.5 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.1 purrr_0.3.2 readr_1.3.1
[10] tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.0 tidyverse_1.2.1
...
The following is a reprex with SQLite, but the same happens with PostgreSQL (I actually stumbled upon that problem with a PostgreSQL DB).
Minimal reprex.
(1) I create 2 data frames, copy them locally to SQLite DB, then load them again as tbl.
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")
df_1
#> # A tibble: 2 x 3
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
df_2
#> # A tibble: 2 x 3
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
dt_1
#> # Source: table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
dt_2
#> # Source: table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
(2) Then I use left_join
, first on the data frames, then on the tbls:
left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> cc
left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
We can see that the second row last column C
has the expected cc
in the case of data frames (by default na_matches = "na"
) but <NA>
in the case of tbl even with the explicit option na_matches = "na"
(which is the default according to the doc). This is unexpected.
EDIT
Notice that this is the same result as in the case of data frames with na_matches = "never"
:
left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
BTW, the title mentions left_join
because it's the most common join, but same problem arises with inner_join
(full_join
is not yet out for data tables), perhaps even more visibly if we keep the na_matches = "na"
in both:
inner_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
inner_join(df_1, df_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> cc
left_join(dt_1 %>% mutate(D = COALESCE(D, "z")), dt_2 %>% mutate(D = COALESCE(D, "z")))
works the same for data table and data frame. One can also usedt_1 %>% mutate(D = ifelse(is.na(D), "z", D))
etc. instead ofCOALESCE
(ormutate_at
ormutate_all
if appropriate). Thanks! – Julijuliax=y
means x equals y and x is not null & y is not null. For true equality in SQL (with NULL equals NULL) the straightforward expression isx=y or (x is null & y is null)
so in Rx=y or (is.na(x) & is.na(y))
. PS Yes "left_join works on data tables the same way as on data frames"--the dplyr code says left_join.data.table(x, ...) is as.data.table(left_join.tbl_df(tbl_df(x),...))). But how to explain the row with 2 <NA>s where na_match is not used? Do you have a trace? Do I misunderstand tibbles vs tbl_df vs df? – Attenuationx = y
. Also, not sure what you mean by "Do you have a trace?". – Julijuliaclass(df_1)
returns"tbl_df" "tbl" "data.frame"
andclass(dt_1)
returns"tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
. Anyway, I showed thatleft_join
does not work the same way with "data frames" as with "data tables" (as you noticed, removing thena_matches
option yields the same output: 2<NA>
s instead of 1). But I thought your NEWS link addressed that. – Julijuliana_matches = "na"
"is only supported for data frames". Not sure what "data frame" exactly means here. If it is just classdata.frame
, possibly withtbl_df
, but excludingtbl
, then it still seems relevant. However, I still think it should be clearly specified in the doc. – Julijuliana_matches
behavior is unexpectedly different for data frames than for data tables. I am ok with that, at least temporarily, since porting the beautiful dplyr package to data tables is still in the making. But it should be clearly stated in the join verbs documentation to avoid this kind of confusion. – Julijulia