Asked 7/11, 2018 at 18:53 Answered 7/11, 2018 at 19:46

EDIT: input

very new to this.

I have a similar problem to this: group by and then count missing variables?

Taking the input data from that question:

df1 <- data.frame(
  Z = sample(LETTERS[1:5], size = 10000, replace = T),
  X1 = sample(c(1:10,NA), 10000, replace = T),
  X2 = sample(c(1:25,NA), 10000, replace = T),
  X3 = sample(c(1:5,NA), 10000, replace = T))

as one user proposed, it's possible to use summarise_each:

df1 %>% 
  group_by(Z) %>% 
  summarise_each(funs(sum(is.na(.))))
#Source: local data frame [5 x 4]
#
#       Z    X1    X2    X3
#  (fctr) (int) (int) (int)
#1      A   169    77   334
#2      B   170    77   316
#3      C   159    78   348
#4      D   181    79   326
#5      E   174    69   341

However, I would like to get only the total number of missing values per group.

I've also tried this but it didn't work: R count NA by group

Ideally, it should give me something like:

#       Z    sumNA 
#  (fctr)   (int) 
#1      A    580
#2      B    493
#3      C    585
#4      D    586
#5      E    584

Thanks in advance.

Motorist answered 7/11, 2018 at 18:53 Comment(3)

Welcome to Stackoverflow! What's your input data for a given result? – Skilled 7/11, 2018 at 18:56

Can you post sample data? Please edit the question with the output of dput(df). Or, if it is too big with the output of dput(head(df, 20)). (df is the name of your dataset.) – Bugle 7/11, 2018 at 18:59

Using the data in the first linked question, maybe group_by(df1, Z) %>% summarize(n = sum(is.na(X1)))? Those aren't the numbers you show here, but that may be due to the uncontrolled randomness (should have used set.seed). – Taber 7/11, 2018 at 19:1

data.table solution

library(data.table)
setDT(df1)

df1[, .(sumNA = sum(is.na(.SD))), by = Z]

#    Z sumNA
# 1: A   559
# 2: C   661
# 3: E   596
# 4: B   597
# 5: D   560

dplyr solution using rowSums(.[-1]), i.e. row-sums for all columns except the first.

library(dplyr)

df1 %>% 
  group_by(Z) %>% 
  summarise_all(~sum(is.na(.))) %>% 
  transmute(Z, sumNA = rowSums(.[-1]))

# # A tibble: 5 x 2
#   Z     sumNA
#   <fct> <dbl>
# 1 A       559
# 2 B       597
# 3 C       661
# 4 D       560
# 5 E       596

Accentor answered 7/11, 2018 at 19:46 Comment(1)

Great data.table way! You can also add a new column to original dataset: setDT(df1)[, sumNA := sum(is.na(.SD)), by = Z] – Ballad 9/4, 2020 at 15:9

You can use the tidyverse approach.

require(tidyverse)
#Sample data
dat <- data.frame(group = rep(c("a", "b", "c", "d", "g"), 3), 
                  y = rep(c(1, NA, 2, NA, 3), 3))


dat %>% 
  group_by(group) %>% 
  summarise(sumNA = sum(is.na(y)))

Output:

  group sumNA
  <fct> <int>
1 a         0
2 b         3
3 c         0
4 d         3
5 g         0

Edit

However, if you have more than one column, you can use summarize_all (or summarize_at if you'd like to specify the columns; thank you @ bschneidr for the comment):

#Sample data
set.seed(123)
dat <- data.frame(group = sample(letters[1:4], 10, replace = T), 
                  x = sample(c(1,NA), 10, replace = T), 
                  y = sample(c(1,NA), 10, replace = T), 
                  z = sample(c(1, NA), 10, replace = T))

dat %>% 
  group_by(group) %>% 
  summarize_all(.funs = funs('NA' = sum(is.na(.))))

# A tibble: 4 x 4
  group  x_NA  y_NA  z_NA
  <fct> <int> <int> <int>
1 a         1     1     0
2 b         3     2     2
3 c         0     1     1
4 d         1     4     2

Linkous answered 7/11, 2018 at 19:3 Comment(2)

What if I have more variables? I could type all the variables in sum(is.na(y+z+...)). But is there a way to do this without having to type every single variable? – Motorist 7/11, 2018 at 19:25

You could use summarize_at from the dplyr package. For example, this would give you missing counts for two variables, X and Y: data %>% group_by(group) %>% summarize_at(.vars = vars(X, Y), .funs = funs('NA' = sum(is.na(.)))) – Coverage 7/11, 2018 at 19:27

If your data looks like the linked post:

df1 <- data.frame(
  Z = as.factor(sample(LETTERS[1:5], size = 10000, replace = T)),
  X1 = sample(c(1:10,NA), 10000, replace = T),
  X2 = sample(c(1:25,NA), 10000, replace = T),
  X3 = sample(c(1:5,NA), 10000, replace = T)
)

You can do the following in base R:

res <- sapply(split(df1[-1], f = df1$Z), function(x) colSums(is.na(x)))
print(res)
#     A   B   C   D   E
#X1 193 180 199 170 183
#X2  74  68  79  90  87
#X3 350 349 340 336 328

If you absolutely need it transposed, you can call t(res):

print(t(res))
#   X1 X2  X3
#A 193 74 350
#B 180 68 349
#C 199 79 340
#D 170 90 336
#E 183 87 328

Edit: If you want the sum of all NAs and not within each variable the following small modification of the above works:

res2 <- sapply(split(df1[-1], f = df1$Z), function(x) sum(is.na(x)))
print(res2)
#  A   B   C   D   E 
#589 588 569 646 598

Alternatively, colSums(res) would give you the same. Again, t() if needed as a column.

Gardia answered 7/11, 2018 at 19:15 Comment(2)

Yes, but I don't want it split into X1, X2, X3. Just the total missing values per group. – Motorist 7/11, 2018 at 19:28

@J.Ziegler So the sum of the NA's in all three columns for each level in Z? I've edited my answer. – Gardia 7/11, 2018 at 19:35

data.table solution

library(data.table)
setDT(df1)

df1[, .(sumNA = sum(is.na(.SD))), by = Z]

#    Z sumNA
# 1: A   559
# 2: C   661
# 3: E   596
# 4: B   597
# 5: D   560

dplyr solution using rowSums(.[-1]), i.e. row-sums for all columns except the first.

library(dplyr)

df1 %>% 
  group_by(Z) %>% 
  summarise_all(~sum(is.na(.))) %>% 
  transmute(Z, sumNA = rowSums(.[-1]))

# # A tibble: 5 x 2
#   Z     sumNA
#   <fct> <dbl>
# 1 A       559
# 2 B       597
# 3 C       661
# 4 D       560
# 5 E       596

Accentor answered 7/11, 2018 at 19:46 Comment(1)

Great data.table way! You can also add a new column to original dataset: setDT(df1)[, sumNA := sum(is.na(.SD)), by = Z] – Ballad 9/4, 2020 at 15:9

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Edit

Recommended topics

Hot tags