left_join for tbl: na_matches not working
Asked Answered
J

1

6

left_join works as expected with NA values on tibbles or data frames, but on tbl it seems it does not match NAs, even with the option na_matches = "na".

R version and package versions

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
...
other attached packages:
 [1] reprex_0.3.0    dbplyr_1.4.2    lubridate_1.7.4 magrittr_1.5    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.1     purrr_0.3.2     readr_1.3.1
[10] tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0   tidyverse_1.2.1
...

The following is a reprex with SQLite, but the same happens with PostgreSQL (I actually stumbled upon that problem with a PostgreSQL DB).

Minimal reprex.

(1) I create 2 data frames, copy them locally to SQLite DB, then load them again as tbl.

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")

df_1
#> # A tibble: 2 x 3
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

df_2
#> # A tibble: 2 x 3
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

dt_1
#> # Source:   table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

dt_2
#> # Source:   table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

(2) Then I use left_join, first on the data frames, then on the tbls:

left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  cc

left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na") but <NA> in the case of tbl even with the explicit option na_matches = "na" (which is the default according to the doc). This is unexpected.

EDIT

Notice that this is the same result as in the case of data frames with na_matches = "never":


left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

BTW, the title mentions left_join because it's the most common join, but same problem arises with inner_join (full_join is not yet out for data tables), perhaps even more visibly if we keep the na_matches = "na" in both:

inner_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c
inner_join(df_1, df_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  cc
Julijulia answered 31/8, 2019 at 2:52 Comment(16)
Read the documentation. na_match doesn't apply to data tables.Attenuation
@Attenuation can you be more specific? I'm upvoting this just because OP took the what seems an extra mile to show the problem at hand.Denudation
@RomanLuštrik From github.com/tidyverse/dplyr/blob/master/NEWS.md#joins: dplyr 0.7.0 Verbs Joins "[API] xxx_join.tbl_df(na_matches = "never") treats all NA values as different from each other (and from any other value), so that they never match. This corresponds to the behavior of joins for database sources, and of database joins in general. To match NA values, pass na_matches = "na" to the join verbs; this is only supported for data frames." There's no mention of na_match in data.table documentation. However I don't know how clear the situation is--Attenuation
@RomanLuštrik --data.table documentation says that tables are frames & can be passed to frame operators. But that may mean, per typical OO semantics, it can they can be operands in calls where the operator name is from a superclass but where the actual operator called is subclass-specific. The data.frame documentaiton quote last comment is as cryptic. I don't know enough about the packages to give a definitive answer. I don't mean to criticize the asker, just inform them. (But they in turn ought to have & give a justification for their expectations.)Attenuation
I would never have thought of checking the NEWS file, thanks. :DDenudation
This answer provides a possible work around: #55453560Havard
@Attenuation Thanks for the NEWS link. My expectation might have been a little naive: That left_join works on data tables the same way as on data frames. I understand that SQL is not as flexible as dplyr and that matching NAs might not be as easy in SQL but it would be nice if such differences in behaviour were specified in the documentation.Julijulia
@RomanLuštrik Thanks for the upvote!Julijulia
@Havard That trick seems to work: left_join(dt_1 %>% mutate(D = COALESCE(D, "z")), dt_2 %>% mutate(D = COALESCE(D, "z"))) works the same for data table and data frame. One can also use dt_1 %>% mutate(D = ifelse(is.na(D), "z", D)) etc. instead of COALESCE (or mutate_at or mutate_all if appropriate). Thanks!Julijulia
(@Havard &) Habert: SQL x=y means x equals y and x is not null & y is not null. For true equality in SQL (with NULL equals NULL) the straightforward expression is x=y or (x is null & y is null) so in R x=y or (is.na(x) & is.na(y)). PS Yes "left_join works on data tables the same way as on data frames"--the dplyr code says left_join.data.table(x, ...) is as.data.table(left_join.tbl_df(tbl_df(x),...))). But how to explain the row with 2 <NA>s where na_match is not used? Do you have a trace? Do I misunderstand tibbles vs tbl_df vs df?Attenuation
@Attenuation Thanks for the clarification with the SQL meaning of x = y. Also, not sure what you mean by "Do you have a trace?".Julijulia
@Attenuation I used the term "data frame" for the df_1 and df_2 above (tibble is just a more modern version of data frame) and the term "data table" for dt_1 and dt_2 above. This is an over-simplification since class(df_1) returns "tbl_df" "tbl" "data.frame" and class(dt_1) returns "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl". Anyway, I showed that left_join does not work the same way with "data frames" as with "data tables" (as you noticed, removing the na_matches option yields the same output: 2 <NA>s instead of 1). But I thought your NEWS link addressed that.Julijulia
Re "trace": I wondered if you had an execution stack trace to see what R was actually calling. I see now the dt_N are neither dfs nor dts--does that not make the NEWS & na_match irrelevant?--but rather are of classes with SQL semantics. I guess the C _left_join eventually called via the left_joins knows about those connection table classes.Attenuation
@Attenuation the NEWS link says na_matches = "na" "is only supported for data frames". Not sure what "data frame" exactly means here. If it is just class data.frame, possibly with tbl_df, but excluding tbl, then it still seems relevant. However, I still think it should be clearly specified in the doc.Julijulia
Thanks. I couldn't find any left_joins but .data.frame & .tbl_df. I can't get at your "doc" link. Wait--your "definition" link is to dbplyr--but it's not explicit in your code. I see I used library(tidyverse) at rdrr.io/snippets with tidyverse 1.2.1 & no dbplyr & apparently just left_join for dt & df in scope plus C++ version. Didn't get any errors on your initialization code & got the same 2 left_join results. Thus I wondered/wonder why SQL semantics for me. debug(left_join) does give a call left_join.tbl_lazy though. PS Why not post an answer incorporating all these comments?Attenuation
@phlipxy Thanks for trying my code snippets. I erased my 2 previous "debug comments" and re-wrote their content a little more intelligibly in the answer post below as you suggested. In conclusion the na_matches behavior is unexpectedly different for data frames than for data tables. I am ok with that, at least temporarily, since porting the beautiful dplyr package to data tables is still in the making. But it should be clearly stated in the join verbs documentation to avoid this kind of confusion.Julijulia
J
4

To respond to @philipxy ’s request to dig a little further in the left_join process I went on debug mode for left_join, first on data tables:

debug(left_join)
left_join(dt_1, dt_2, na_matches = "na")
#>  debugging in: left_join(dt_1, dt_2, na_matches = "na")
#>  debug: {
#>      UseMethod("left_join")
#>  }
Browse[2]>  n
#>  debug: UseMethod("left_join")
#>  Browse[2]> n
#>  debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#>  debug: {
#>      add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>          suffix = suffix, auto_index = auto_index, ...)
#>  }
Browse[3]>
#>  debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
Browse[3]> s
#>  debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
#>  debug: {
#>      if (!is.null(sql_on)) {
#>         by <- list(x = character(0), y = character(0), on = sql(sql_on))
#>      }
#>      else if (identical(type, "full") && identical(by, character())) {
#>          type <- "cross"
#>          by <- list(x = character(0), y = character(0))
#>      }
#>      else {
#>          by <- common_by(by, x, y)
#>      }
#>      y <- auto_copy(x, y, copy = copy, indexes = if (auto_index)
#>          list(by$y))
#>      vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by,
#>          suffix = suffix)
#>      x$ops <- op_double("join", x, y, args = list(vars = vars,
#>          type = type, by = by, suffix = suffix))
#>      x
#>  }
Browse[4]> f
#>  Joining, by = c("A", "D")
#>  exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
#>  exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#>  exiting from: left_join(dt_1, dt_2, na_matches = "na")
#>  # Source:   lazy query [?? x 4]
#>  # Database: sqlite 3.29.0 [:memory:]
#>    A     B     D     C
#>    <chr> <chr> <chr> <chr>
#>  1 a     b     d     c
#>  2 aa    bb    NA    NA

We see that left_join calls left_join.tbl_lazy on data tables with the na_matches = “na” option. However this is followed by a call to add_op_join the definition of which does not have any mention of na_matches.

Then, by contrast, on data frames:

left_join(df_1, df_2)
#>  debugging in: left_join(df_1, df_2)
#>  debug: {
#>      UseMethod("left_join")
#>  }
Browse[2]> n
#>  debug: UseMethod("left_join")
Browse[2]>
#>  debugging in: left_join.tbl_df(df_1, df_2)
#>  debug: {
#>      check_valid_names(tbl_vars(x))
#>      check_valid_names(tbl_vars(y))
#>      by <- common_by(by, x, y)
#>      suffix <- check_suffix(suffix)
#>      na_matches <- check_na_matches(na_matches)
#>      y <- auto_copy(x, y, copy = copy)
#>      vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix)
#>      by_x <- vars$idx$x$by
#>      by_y <- vars$idx$y$by
#>      aux_x <- vars$idx$x$aux
#>      aux_y <- vars$idx$y$aux
#>      out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches,
#>          environment())
#>      names(out) <- vars$alias
#>      reconstruct_join(out, x, vars)
#>  }
Browse[3]>
#>  debug: check_valid_names(tbl_vars(x))
Browse[3]>
#>  debug: check_valid_names(tbl_vars(y))
Browse[3]>
#>  debug: by <- common_by(by, x, y)
Browse[3]>
#>  Joining, by = c("A", "D")
#>  debug: suffix <- check_suffix(suffix)
Browse[3]>
#>  debug: na_matches <- check_na_matches(na_matches)
Browse[3]>
#>  debug: y <- auto_copy(x, y, copy = copy)
Browse[3]> na_matches
#>  [1] TRUE
Browse[3]> f
#>  exiting from: left_join.tbl_df(df_1, df_2)
#>  exiting from: left_join(df_1, df_2)
#>  # A tibble: 2 x 4
#>    A     B     D     C
#>    <chr> <chr> <chr> <chr>
#>  1 a     b     d     c
#>  2 aa    bb    NA    cc

We see that left_join calls left_join.tbl_df on data frames. Further down we see that na_matches is set to TRUE before being used as argument in left_join_impl. All this makes sense.

When typing ?left_join.tbl_lazy the doc returns a local page for join.tbl_sql {dbplyr} which states for the unspecified arguments (the ):

“Other parameters passed onto methods, for instance, na_matches to control how NA values are matched. See join.tbl_df for more”.

Following the join.tbl_df doc link, it clearly mentions na_matches:

"Use 'never' to always treat two NA or NaN values as different, like joins for database sources, similarly to merge(incomparables = FALSE). The default, 'na', always treats two NA or NaN values as equal, like merge(). Users and package authors can change the default behavior by calling pkgconfig::set_config('dplyr::na_matches' = 'never')".

So there seems to be some inconsistency between the doc and the code for data tables.

Also, @philipxy mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames". Now the class of dt_1 and df_1 are:

class(df_1)
#>  [1] "tbl_df"     "tbl"        "data.frame"
class(dt_1)
#>  [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"
#>  [4] "tbl_lazy"             "tbl"

I suppose the term "data frame" refers to classes data.frame and tbl_df, and what I called "data table" are the other tbl_*s including tbl_sql and tbl_lazy. So this news link also answers the question.

Still, I think the current documentation for the join verbs is confusing. It should clearly state:

"The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other choice) for data tables".

Hopefully, the choice na_matches = "na" will be implemented not too long in the future for data tables.

Julijulia answered 5/9, 2019 at 0:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.