Proper idiom for adding zero count rows in tidyr/dplyr
Asked Answered
M

6

51

Suppose I have some count data that looks like this:

library(tidyr)
library(dplyr)

X.raw <- data.frame(
  x = as.factor(c("A", "A", "A", "B", "B", "B")),
  y = as.factor(c("i", "ii", "ii", "i", "i", "i")),
  z = 1:6
)
X.raw
#   x  y z
# 1 A  i 1
# 2 A ii 2
# 3 A ii 3
# 4 B  i 4
# 5 B  i 5
# 6 B  i 6

I'd like to tidy and summarise like this:

X.tidy <- X.raw %>% group_by(x, y) %>% summarise(count = sum(z))
X.tidy
# Source: local data frame [3 x 3]
# Groups: x
#
#   x  y count
# 1 A  i     1
# 2 A ii     5
# 3 B  i    15

I know that for x=="B" and y=="ii" we have observed count of zero, rather than a missing value. i.e. the field worker was actually there, but because there wasn't a positive count no row was entered into the raw data. I can add the zero count explicitly by doing this:

X.fill <- X.tidy %>% spread(y, count, fill = 0) %>% gather(y, count, -x)
X.fill
# Source: local data frame [4 x 3]
# 
#   x  y count
# 1 A  i     1
# 2 B  i    15
# 3 A ii     5
# 4 B ii     0

But that seems a little bit of a roundabout way of doing things. Is there a cleaner idiom for this?

Just to clarify: My code already does what I need it to do, using spread then gather, so what I'm interested in is finding a more direct route within tidyr and dplyr.

Mame answered 21/9, 2014 at 5:39 Comment(4)
If you know that for "B"/"ii" there is an observed count of zero, why isn't that in your source dataset? Your source dataset presently indicates that combination as missing.Spinster
The raw dataset only includes positive counts. But we know all combinations were surveyed.Mame
Do you have another table that shows what combinations of "x" and "y" are all valid? Otherwise, how do you plan to distinguish between 0 and NA?Spinster
There aren't any NAs.Mame
C
46

Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:

X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups:   x [2]
#   x     y     count
#   <fct> <fct> <int>
# 1 A     i         1
# 2 A     ii        5
# 3 B     i        15
# 4 B     ii        0

This will keep groups made of all the levels of factor columns so if you have character columns you might want to convert them (thanks to Pake for the note).

Confrere answered 20/2, 2019 at 20:41 Comment(2)
just a hint if you made it here and this isn't working for you: be sure you're using factors and not character columns in your group_by()Condescend
@Condescend big thanks for this comment! I weas going mad trying to work out why this didn't work.Mercado
C
33

The complete function from tidyr is made for just this situation.

From the docs:

This is a wrapper around expand(), left_join() and replace_na that's useful for completing missing combinations of data.

You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).

X.raw %>% 
    complete(x, y, fill = list(z = 0)) %>% 
    group_by(x,y) %>% 
    summarise(count = sum(z))

Source: local data frame [4 x 3]
Groups: x [?]

       x      y count
  <fctr> <fctr> <dbl>
1      A      i     1
2      A     ii     5
3      B      i    15
4      B     ii     0

You can also use complete on your pre-summarized dataset. Note that complete respects grouping. X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.

# Complete after ungrouping
X.tidy %>% 
    ungroup %>%
    complete(x, y, fill = list(count = 0))

# Complete within grouping
X.tidy %>% 
    complete(y, fill = list(count = 0))

The result is the same for each option:

Source: local data frame [4 x 3]

       x      y count
  <fctr> <fctr> <dbl>
1      A      i     1
2      A     ii     5
3      B      i    15
4      B     ii     0
Chiromancy answered 9/6, 2016 at 13:57 Comment(0)
J
4

You can use tidyr's expand to make all combinations of levels of factors, and then left_join:

X.tidy %>% expand(x, y) %>% left_join(X.tidy)

# Joining by: c("x", "y")
# Source: local data frame [4 x 3]
# 
#   x  y count
# 1 A  i     1
# 2 A ii     5
# 3 B  i    15
# 4 B ii    NA

Then you may keep values as NAs or replace them with 0 or any other value. That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than spread & gather.

Judah answered 1/7, 2015 at 13:17 Comment(2)
Also see complete in the development version of tidyr (0.2.0.9000), which is a handy wrapper for expand, left_join, and replace_na.Chiromancy
Thanks @Chiromancy that's exactly what I needed. If you write it up as an answer I'll accept it.Mame
H
3

plyr has the functionality you're looking for, but dplyr doesn't (yet), so you need some extra code to include the zero-count groups, as shown by @momeara. Also see this question. In plyr::ddply you just add .drop=FALSE to keep zero-count groups in the final result. For example:

library(plyr)

X.tidy = ddply(X.raw, .(x,y), summarise, count=sum(z), .drop=FALSE)

X.tidy
  x  y count
1 A  i     1
2 A ii     5
3 B  i    15
4 B ii     0
Henleigh answered 21/9, 2014 at 6:18 Comment(1)
do you mean ddply instead of dplyr in the second line?Diadromous
D
2

You could explicitly make all possible combinations and then joining it with the tidy summary:

x.fill <- expand.grid(x=unique(x.tidy$x), x=unique(x.tidy$y)) %>%
    left_join(x.tidy, by=("x", "y")) %>%
    mutate(count = ifelse(is.na(count), 0, count)) # replace null values with 0's
Diadromous answered 21/9, 2014 at 6:16 Comment(0)
A
0

You can also use the data.table package and its Cross Join CJ() function for that.

require(data.table)

X = data.table(X.raw)[
  CJ(y = y,
     x = x,
     unique = TRUE), 
  on = .(x, y)
  ][ , .(z = sum(z)), .(x, y) ][ order(x, y) ]
X

# filling the NAs with 0s
setnafill(X, fill = 0, cols = 'z')
X
#    x  y  z
# 1: A  i  1
# 2: A ii  5
# 3: B  i 15
# 4: B ii  0

Though it's not initially asked for, I'm adding a data.table solution here for the sake of completeness and to also link to the related data.table question.

Adebayo answered 13/7, 2022 at 13:51 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.