Collapsing rows where some are all NA, others are disjoint with some NAs

Asked 20/1, 2015 at 1:22 Answered 19/11, 2020 at 10:59

I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA  
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA  
2     25      30      NA      NA
2     NA      NA      35      40

And I would like to reformat it as such:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

STEP ONE: get rid of rows that have only NA values.

On the surface this looked simple, but I've run across some problems.

complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

STEP TWO: collapse the remaining two rows into one.

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn't work at all.

Thanks for any advice.

Judon answered 20/1, 2015 at 1:22 Comment(4)

Is the data always 100% numeric? – Laetitia 20/1, 2015 at 1:41

related: blend of na.omit and na.pass using aggregate in r – Marquet 20/1, 2015 at 2:16

Is there always exactly one non-NA value per column per group or could there be more? – Wayworn 20/1, 2015 at 8:2

More! For better or worse – Judon 20/1, 2015 at 8:28

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Mithridatism answered 20/1, 2015 at 1:33 Comment(3)

I think summarise_each(funs(na.omit(.)) ) also works – Dud 20/1, 2015 at 3:41

Yes, but I think that would yield a little dissonant result if any of the columns had more than 2 non-NA observations. – Mithridatism 20/1, 2015 at 3:54

This solution doesn't work if any column has all NAs for a certain ID. For example replace 25 by NA for ID = 2 in Col1. Whereas funs(na.omit(.)) will throw error Error: Column Col1 must be length 1 (a summary value), not 0 – Happen 10/9, 2018 at 18:33

Here's a data table approach that uses na.omit() across the columns, grouped by ID.

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
#    ID Col1 Col2 Col3 Col4
# 1:  1    5   10   15   20
# 2:  2   25   30   35   40

Laetitia answered 20/1, 2015 at 1:59 Comment(2)

na.omit is clever. Better than sum and will work on non-numeric data. I'm discreetly going to edit... – Marquet 20/1, 2015 at 2:3

This solution doesn't work if any column has all NAs for a certain ID. – Lianneliao 18/9, 2020 at 12:25

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Mithridatism answered 20/1, 2015 at 1:33 Comment(3)

I think summarise_each(funs(na.omit(.)) ) also works – Dud 20/1, 2015 at 3:41

Yes, but I think that would yield a little dissonant result if any of the columns had more than 2 non-NA observations. – Mithridatism 20/1, 2015 at 3:54

Here's a couple of aggregate attempts:

aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Since aggregate's formula interface by default uses na.omit on the entire data before doing any grouping, it will delete every row of dat as they all contain at least one NA value. Try it: nrow(na.omit(dat)) returns 0. So in this case, use na.pass in aggregate and then na.omit to skip over the NAs that were passed through.

Alternatively, don't use the formula interface and specify the columns to aggregate manually:

aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Marquet answered 20/1, 2015 at 1:32 Comment(0)

Since dplyr 1.0.0, you can also do (using the data provided by @Khashaa):

df %>% 
 group_by(ID) %>%
 summarize(across(everything(), ~ first(na.omit(.))))

     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Jemmy answered 19/11, 2020 at 10:59 Comment(0)

the simple way is:

as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)]))

but if not all columns have the same number of non-NA values then you'll need to trim them like so:

temp  <-  lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)])
len  <-  min(sapply(temp,length))
as.data.frame(lapply(temp,`[`,seq(len)))

Grantham answered 20/1, 2015 at 1:32 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags