Count total missing values by group?
Asked Answered
M

3

6

EDIT: input

very new to this.

I have a similar problem to this: group by and then count missing variables?

Taking the input data from that question:

df1 <- data.frame(
  Z = sample(LETTERS[1:5], size = 10000, replace = T),
  X1 = sample(c(1:10,NA), 10000, replace = T),
  X2 = sample(c(1:25,NA), 10000, replace = T),
  X3 = sample(c(1:5,NA), 10000, replace = T))

as one user proposed, it's possible to use summarise_each:

df1 %>% 
  group_by(Z) %>% 
  summarise_each(funs(sum(is.na(.))))
#Source: local data frame [5 x 4]
#
#       Z    X1    X2    X3
#  (fctr) (int) (int) (int)
#1      A   169    77   334
#2      B   170    77   316
#3      C   159    78   348
#4      D   181    79   326
#5      E   174    69   341  

However, I would like to get only the total number of missing values per group.

I've also tried this but it didn't work: R count NA by group

Ideally, it should give me something like:

#       Z    sumNA 
#  (fctr)   (int) 
#1      A    580
#2      B    493
#3      C    585
#4      D    586
#5      E    584  

Thanks in advance.

Motorist answered 7/11, 2018 at 18:53 Comment(3)
Welcome to Stackoverflow! What's your input data for a given result?Skilled
Can you post sample data? Please edit the question with the output of dput(df). Or, if it is too big with the output of dput(head(df, 20)). (df is the name of your dataset.)Bugle
Using the data in the first linked question, maybe group_by(df1, Z) %>% summarize(n = sum(is.na(X1)))? Those aren't the numbers you show here, but that may be due to the uncontrolled randomness (should have used set.seed).Taber
A
2

data.table solution

library(data.table)
setDT(df1)

df1[, .(sumNA = sum(is.na(.SD))), by = Z]

#    Z sumNA
# 1: A   559
# 2: C   661
# 3: E   596
# 4: B   597
# 5: D   560

dplyr solution using rowSums(.[-1]), i.e. row-sums for all columns except the first.

library(dplyr)

df1 %>% 
  group_by(Z) %>% 
  summarise_all(~sum(is.na(.))) %>% 
  transmute(Z, sumNA = rowSums(.[-1]))

# # A tibble: 5 x 2
#   Z     sumNA
#   <fct> <dbl>
# 1 A       559
# 2 B       597
# 3 C       661
# 4 D       560
# 5 E       596
Accentor answered 7/11, 2018 at 19:46 Comment(1)
Great data.table way! You can also add a new column to original dataset: setDT(df1)[, sumNA := sum(is.na(.SD)), by = Z]Ballad
L
8

You can use the tidyverse approach.

require(tidyverse)
#Sample data
dat <- data.frame(group = rep(c("a", "b", "c", "d", "g"), 3), 
                  y = rep(c(1, NA, 2, NA, 3), 3))


dat %>% 
  group_by(group) %>% 
  summarise(sumNA = sum(is.na(y)))

Output:

  group sumNA
  <fct> <int>
1 a         0
2 b         3
3 c         0
4 d         3
5 g         0

Edit

However, if you have more than one column, you can use summarize_all (or summarize_at if you'd like to specify the columns; thank you @ bschneidr for the comment):

#Sample data
set.seed(123)
dat <- data.frame(group = sample(letters[1:4], 10, replace = T), 
                  x = sample(c(1,NA), 10, replace = T), 
                  y = sample(c(1,NA), 10, replace = T), 
                  z = sample(c(1, NA), 10, replace = T))

dat %>% 
  group_by(group) %>% 
  summarize_all(.funs = funs('NA' = sum(is.na(.))))

# A tibble: 4 x 4
  group  x_NA  y_NA  z_NA
  <fct> <int> <int> <int>
1 a         1     1     0
2 b         3     2     2
3 c         0     1     1
4 d         1     4     2
Linkous answered 7/11, 2018 at 19:3 Comment(2)
What if I have more variables? I could type all the variables in sum(is.na(y+z+...)). But is there a way to do this without having to type every single variable?Motorist
You could use summarize_at from the dplyr package. For example, this would give you missing counts for two variables, X and Y: data %>% group_by(group) %>% summarize_at(.vars = vars(X, Y), .funs = funs('NA' = sum(is.na(.))))Coverage
G
2

If your data looks like the linked post:

df1 <- data.frame(
  Z = as.factor(sample(LETTERS[1:5], size = 10000, replace = T)),
  X1 = sample(c(1:10,NA), 10000, replace = T),
  X2 = sample(c(1:25,NA), 10000, replace = T),
  X3 = sample(c(1:5,NA), 10000, replace = T)
)

You can do the following in base R:

res <- sapply(split(df1[-1], f = df1$Z), function(x) colSums(is.na(x)))
print(res)
#     A   B   C   D   E
#X1 193 180 199 170 183
#X2  74  68  79  90  87
#X3 350 349 340 336 328

If you absolutely need it transposed, you can call t(res):

print(t(res))
#   X1 X2  X3
#A 193 74 350
#B 180 68 349
#C 199 79 340
#D 170 90 336
#E 183 87 328

Edit: If you want the sum of all NAs and not within each variable the following small modification of the above works:

res2 <- sapply(split(df1[-1], f = df1$Z), function(x) sum(is.na(x)))
print(res2)
#  A   B   C   D   E 
#589 588 569 646 598 

Alternatively, colSums(res) would give you the same. Again, t() if needed as a column.

Gardia answered 7/11, 2018 at 19:15 Comment(2)
Yes, but I don't want it split into X1, X2, X3. Just the total missing values per group.Motorist
@J.Ziegler So the sum of the NA's in all three columns for each level in Z? I've edited my answer.Gardia
A
2

data.table solution

library(data.table)
setDT(df1)

df1[, .(sumNA = sum(is.na(.SD))), by = Z]

#    Z sumNA
# 1: A   559
# 2: C   661
# 3: E   596
# 4: B   597
# 5: D   560

dplyr solution using rowSums(.[-1]), i.e. row-sums for all columns except the first.

library(dplyr)

df1 %>% 
  group_by(Z) %>% 
  summarise_all(~sum(is.na(.))) %>% 
  transmute(Z, sumNA = rowSums(.[-1]))

# # A tibble: 5 x 2
#   Z     sumNA
#   <fct> <dbl>
# 1 A       559
# 2 B       597
# 3 C       661
# 4 D       560
# 5 E       596
Accentor answered 7/11, 2018 at 19:46 Comment(1)
Great data.table way! You can also add a new column to original dataset: setDT(df1)[, sumNA := sum(is.na(.SD)), by = Z]Ballad

© 2022 - 2024 — McMap. All rights reserved.