Inconsistency of na.action between xtabs and aggregate in R
Asked Answered
B

2

5

I have the following data.frame:

x <- data.frame(A = c("Y", "Y", "Z", NA),
                B = c(NA, TRUE, FALSE, TRUE),
                C = c(TRUE, TRUE, NA, FALSE))

And I need to compute the following table with xtabs:

A      B C
  Y    1 2
  Z    0 0
  <NA> 1 0

I was told to use na.action = NULL, which indeed returns the table I need:

xtabs(formula = cbind(B, C) ~ A,
      data = x,
      addNA = TRUE,
      na.action = NULL)

A      B C
  Y    1 2
  Z    0 0
  <NA> 1 0

However, na.action = na.pass returns a different table:

xtabs(formula = cbind(B, C) ~ A,
      data = x,
      addNA = TRUE,
      na.action = na.pass)

A       B  C
  Y        2
  Z     0   
  <NA>  1  0

But the documentation of xtabs says:

na.action
When it is na.pass and formula has a left hand side (with counts), sum(*, na.rm = TRUE) is used instead of sum(*) for the counts.

With aggregate, na.action = na.pass returns the expected result (and also na.action = NULL):

aggregate(formula = cbind(B, C) ~ addNA(A),
          data = x,
          FUN = sum,
          na.rm = TRUE,
          na.action = na.pass) # same result with na.action = NULL

  addNA(A) B C
1            Y 1 2
2            Z 0 0
3         <NA> 1 0

Although I get the table I need with xtabs, I do not understand the behavior of na.action in xtabs from the documentation. So my questions are:

  • Is the behavior of na.action in xtabs consistent with the documentation? Unless I am missing something, na.action = na.pass does not result in sum(*, na.rm = TRUE).
  • Is na.action = NULL documented somewhere?
  • In xtabs source code there is na.rm <- identical(naAct, quote(na.omit)) || identical(naAct, na.omit) || identical(naAct, "na.omit"). But I saw nothing for na.action = na.pass and na.action = NULL. How do na.action = na.pass and na.action = NULL work?
Baylor answered 15/4, 2020 at 23:6 Comment(2)
you can check a function which indicates what should happen when the data contain NAs. If unspecified, and addNA is true, this is set to na.pass. When it is na.pass and formula has a left hand side (with counts), sum(*, na.rm = TRUE) is used instead of sum(*) for the counts.Adulate
Thank you, but that is exactly what I do not understand. The documentation do not seem to be consistent with the behavior, or at least I fail to understand the documentation. I edited my question to clarify.Baylor
D
6

It's difficult to give a cannonical answer without describing how xtabs works. If we step through the main points of its source code, we'll see clearly what's going on.

After some basic type checking, the call to xtabs works internally by first creating a data frame of all the variables contained in your formula using stats::model.frame, and it is to this that the na.action parameter is passed.

The way it does this is quite clever. xtabs first copies the call you made to it via match.call, like this:

m <- match.call(expand.dots = FALSE)

Then it strips out the parameters that don't need passed to stats::model.frame like this:

m$... <- m$exclude <- m$drop.unused.levels <- m$sparse <- m$addNA <- NULL

As promised in the help file, if addNA is TRUE and na.action is missing, it will now default to na.pass:

    if (addNA && missing(na.action)) 
        m$na.action <- quote(na.pass)

Then it changes the function to be called from xtabs to stats::model.frame like this:

m[[1L]] <- quote(stats::model.frame)

So the object m is a call (and is also a standalone reprex), which in your case looks like this:

stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L, 
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE), 
C = c(TRUE, TRUE, NA, FALSE)), na.action = NULL)

Note that your na.action = NULL has been passed to this call. This has the effect of keeping all NA values in the frame. When the above call is evaluated, it gives this data frame:

eval(m)
#>   cbind(B, C).B cbind(B, C).C    A
#> 1            NA          TRUE    Y
#> 2          TRUE          TRUE    Y
#> 3         FALSE            NA    Z
#> 4          TRUE         FALSE <NA>

Note that this is the same result you would get if you passed na.action = na.pass:

stats::model.frame(formula = cbind(B, C) ~ A, data = list(A = structure(c(1L, 
1L, 2L, NA), .Label = c("Y", "Z"), class = "factor"), B = c(NA, TRUE, FALSE, TRUE), 
C = c(TRUE, TRUE, NA, FALSE)), na.action = na.pass)
#>   cbind(B, C).B cbind(B, C).C    A
#> 1            NA          TRUE    Y
#> 2          TRUE          TRUE    Y
#> 3         FALSE            NA    Z
#> 4          TRUE         FALSE <NA>

However, if you passed na.action = na.omit, you would only be left with a single row, since only row 2 has no NA values.

In any case, the "model frame" result is stored in the variable mf. This is then split into the independent variable(s), - in your case, column A, and the response variable - in your case cbind(B, C).

The response is stored in y and the variable in by:

        i <- attr(attr(mf, "terms"), "response")
        by <- mf[-i]
        y <- mf[[i]]

Now, by is processed to ensure each independent variable is a factor, and that any NA values are converted into factor levels if you have specified addNA = TRUE:

    by <- lapply(by, function(u) {
        if (!is.factor(u)) 
            u <- factor(u, exclude = exclude)
        else if (has.exclude) 
            u <- factor(as.character(u), levels = setdiff(levels(u), 
                exclude), exclude = NULL)
        if (addNA) 
            u <- addNA(u, ifany = TRUE)
        u[, drop = drop.unused.levels]
    })

Now we come to the crux. The na.action is used again to determine how the NA values in the response variable will be counted. In your case, since you passed na.action = NULL, you will see that naAct will get the value stored in getOption("na.action"), which if you have never changed it, should be set to na.omit. This in turn will cause the value of the variable na.rm, to be TRUE:

    naAct <- if (!is.null(m$na.action)) {
        m$na.action
    }else {getOption("na.action", default = quote(na.omit))}
    na.rm <- identical(naAct, quote(na.omit)) || identical(naAct, 
        na.omit) || identical(naAct, "na.omit")

Note that if you had passed na.action = na.pass, then na.rm would be FALSE if you trace this piece of code.

Finally, we come to the section where your xtabs table is built using sum inside a tapply, which is itself inside an lapply.

lapply(as.data.frame(y), tapply, by, sum, na.rm = na.rm, default = 0L)

You can see that the na.rm variable is used to determine whether to remove NAs from the columns before attempting to sum them. The result of this lapply is then coerced into the final cross tab.


So how does this answer your question?

It is true when the documentation says that if you don't pass an na.action, it will default to na.pass. However, the na.action is used in two places: once in the call to model.frame and once to determine the value of na.rm. It is very clear from the source code that if na.action is na.pass, then na.rm will be FALSE, so you will miss out on the counts of any response groups containing NA values. This is the opposite of what is written in the help file.

The only way round this is to pass na.action = NULL, since this will allow model.frame to keep NA values, but will also cause the sum function to default to na.rm.


TL;DR The documentation for xtabs is wrong on this point.

Davedaveda answered 26/4, 2020 at 23:7 Comment(3)
Thank you very much for this excellent, clear, and detailed answer! I am waiting for the deadline before awarding the bounty, but so far you have won it! I am going to report a bug to improve the documentation.Baylor
@Baylor thank you. I think it would be better if they changed the function to match the documentation rather than the other way around?Davedaveda
I agree. Here is the bug report: bugs.r-project.org/bugzilla/show_bug.cgi?id=17770. Do not hesitate to step in!Baylor
B
2

I'm sorry I'm only joining now. Indeed, the last half dozen changes to xtabs() were all by me, so I have to take responsibility here, too.

Delving into all variants and ramifications of xtabs() always takes some time which I've not taken yet (this time; of course did back then..).

But you finally deserve an answer:

  • Yes, there's bug -- either in the R code or in the documentation (and that "or" is inclusive .. ;-)

  • my current gut feeling is pointing to a bug in the help (file) rather than the implementation

  • R's bugzilla is the place we should get into details about this, not the least because that's "wired" to the R Core team's channels.

  • --> follow up there: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17770.

Benzophenone answered 13/6, 2020 at 14:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.