Merge rows in a dataframe where the rows are disjoint and contain NAs

Asked 10/1, 2013 at 22:58 Answered 26/9, 2018 at 18:45

Solved r dataframe dplyr data.table coalesce

I have a dataframe that has two rows:

| code | name  | v1 | v2 | v3 | v4 |
|------|-------|----|----|----|----|
| 345  | Yemen | NA | 2  | 3  | NA |
| 346  | Yemen | 4  | NA | NA | 5  |

Is there an easy way to merge these two rows? What if I rename "345" in "346", would that make things easier?

Bratton answered 10/1, 2013 at 22:58 Comment(2)

You will need some rule for combining non-NA column. Such as will you always take the first occcurence or last, the mean of numeric columns etc. – Endocarditis 10/1, 2013 at 23:4

The coalesce() functionality is needed here . Found a good discussion on this thread : [link]#19254320 – Pyroconductivity 5/3, 2015 at 11:56

You can use aggregate. Assuming that you want to merge rows with identical values in column name:

aggregate(x=DF[c("v1","v2","v3","v4")], by=list(name=DF$name), min, na.rm = TRUE)
   name v1 v2 v3 v4
1 Yemen  4  2  3  5

This is like the SQL SELECT name, min(v1) GROUP BY name. The min function is arbitrary, you could also use max or mean, all of them return the non-NA value from an NA and a non-NA value if na.rm = TRUE. (An SQL-like coalesce() function would sound better if existed in R.)

However, you should check first if all non-NA values for a given name is identical. For example, run the aggregate both with min and max and compare, or run it with range.

Finally, if you have many more variables than just v1-4, you could use DF[,!(names(DF) %in% c("code","name"))] to define the columns.

Xeroderma answered 11/1, 2013 at 17:44 Comment(2)

running your example gives me Error in DF$name : $ operator is invalid for atomic vectors – Moynihan 20/1, 2015 at 1:17

@MattO'Brien how does your DF look like? Is it a data frame? Does if have multiple columns? Do you have code to replicate the error? – Xeroderma 20/1, 2015 at 10:24

Adding dplyr & data.table solutions for completeness

Using dplyr::coalesce()

library(dplyr)

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

df %>% 
  group_by(name) %>% 
  summarise_all(sum_NA)
#> # A tibble: 1 x 6
#>   name   code    v1    v2    v3    v4
#>   <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Yemen   691     4     2     3     5

# Ref: https://stackoverflow.com/a/45515491
# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
  return(dplyr::coalesce(!!! as.list(df)))
}

df %>% 
  group_by(name) %>% 
  summarise_all(coalesce_by_column)
#> # A tibble: 1 x 6
#>   name   code    v1    v2    v3    v4
#>   <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Yemen   345     4     2     3     5

Using data.table

# Ref: https://mcmap.net/q/152312/-collapsing-rows-where-some-are-all-na-others-are-disjoint-with-some-nas/
library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = name]
#>     name code v1 v2 v3 v4
#> 1: Yemen  345  4  2  3  5
#> 2: Yemen  346  4  2  3  5

setDT(df)[, code := NULL][, lapply(.SD, na.omit), by = name]    
#>     name v1 v2 v3 v4
#> 1: Yemen  4  2  3  5

setDT(df)[, code := NULL][, lapply(.SD, sum_NA), by = name]
#>     name v1 v2 v3 v4
#> 1: Yemen  4  2  3  5

Harpist answered 26/9, 2018 at 18:45 Comment(0)

Recommended topics

Hot tags