dplyr cross tab with missing values
Asked Answered
M

3

9

I would like to make a cross tab in R using dplyr. I have good reasons for not just using the base table() command.

table(mtcars$cyl, mtcars$gear)
     3  4  5
  4  1  8  2
  6  2  4  1
  8 12  0  2

library(dplyr)
library(tidyr)
mtcars %>% 
  group_by(cyl, gear) %>%
  tally() %>%
  spread(gear, n, fill = 0)
Source: local data frame [3 x 4]

  cyl  3 4 5
1   4  1 8 2
2   6  2 4 1
3   8 12 0 2

This is all well and good. But it seems to fall apart when there are missing values in the group_by() variables.

mtcars %>%
  mutate(
    cyl = ifelse(cyl > 6, NA, cyl),
    gear = ifelse(gear > 4, NA, gear)
  ) %>%
  group_by(cyl, gear) %>%
  tally()
Source: local data frame [8 x 3]
Groups: cyl

  cyl gear  n
1   4    3  1
2   4    4  8
3   4   NA  2
4   6    3  2
5   6    4  4
6   6   NA  1
7  NA    3 12
8  NA   NA  2

# DITTO # %>%
  spread(gear, n)
Error in if (any(names2(x) == "")) { : 
  missing value where TRUE/FALSE needed

I guess what I would like is for a NA column like when you do table(..., useNA = "always"). Any tips?

Melioration answered 17/3, 2015 at 19:52 Comment(5)
Could you just replace the NAs with meaningful labels, like "cyl.NA".Sequence
I suppose, but cross tabs are usually exploratory in and of themselves. Exploring how you should explore seems tedious. But maybe necessary.Melioration
File a minimal reproducible bug report on the tidyr repo?Premolar
Done, filed issue #68Melioration
@Melioration - do you have any updates on this one? I was wondering if anything came of the bug reportProffitt
S
9

One option is to replace the NAs with a label. This can be accomplished easily with mutate_each:

mtcars %>%
  mutate(
    cyl = ifelse(cyl > 6, NA, cyl),
    gear = ifelse(gear > 4, NA, gear)
  ) %>%
  group_by(cyl, gear) %>%
  tally() %>%
  ungroup() %>%
  mutate_each(funs(replace(., is.na(.), 'missing'))) %>%
  spread(gear, n)

#       cyl  3  4 missing
# 1       4  1  8       2
# 2       6  2  4       1
# 3 missing 12 NA       2
Sequence answered 17/3, 2015 at 20:36 Comment(1)
I like this and will go with it for now, but I think a more permanent solution should come through a tidyr bug fix.Melioration
T
2

Agreed that the permanent solution to this should be a tidyr bug fix, but in the meantime, this can be worked around by dropping the dplyr tbl_df format:

mtcars %>%
  mutate(
    cyl = ifelse(cyl > 6, NA, cyl),
    gear = ifelse(gear > 4, NA, gear)
  ) %>%
  group_by(cyl, gear) %>%
  tally() %>%
  data.frame() %>% ### <-- go from tbl_df to data.frame
  spread(gear, n)

  cyl  3  4 NA
1   4  1  8  2
2   6  2  4  1
3  NA 12 NA  2

The addition of the data.frame() call allows your code to run, though it produces a column named NA so this is probably best suited for exploratory analyses that print to the console.

Tuning answered 26/8, 2015 at 13:35 Comment(0)
C
2

Here's an updated answer that works with current dplyr (1.1.0) and tidyr (1.3.0) in 2023.

library(tidyr); library(dplyr)
mtcars %>%
  mutate(
    cyl = ifelse(cyl > 6, NA, cyl),
    gear = ifelse(gear > 4, NA, gear)
  ) %>%
  count(cyl, gear) %>%
  mutate(across(everything(), ~coalesce(as.character(.), "missing"))) %>%
  pivot_wider(names_from = gear, values_from = n)

# A tibble: 3 × 4
  cyl     `3`   `4`   missing
  <chr>   <chr> <chr> <chr>  
1 4       1     8     2      
2 6       2     4     1      
3 missing 12    NA    2 
Chanukah answered 16/2, 2023 at 15:49 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.