Aggregating based on "near" row values

Asked 29/3, 2017 at 17:45 Answered 29/3, 2017 at 18:12

I have a very messy dataframe (webscraped) that unfortunately has many double and even triple entries in it. Most of the dataframe looks like this:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d

"var1" is my ID variable that I need to use to aggregate. My goal is to have a dataframe that looks like this:

  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d

However, the main problem is, that not the whole dataframe looks like this. In fact, I have other parts that look like this:

> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>

And other variations. In the end, every ID should have one row with the correct and right var2 and var3 in it. At this point, I get lost: My var1 is not unique. However, I know that duplicate IDs that "belong" together are grouped in the dataframe (as seen in my examples); e.g. there might be another "a" in row 4102 and 4103.

What I think would be the way to go is using aggregate with var1 as ID,but in addition telling R that aggregate should just check +-2 rows of var1 when doing so. Any ideas how to code this?

Thanks!

Sophocles answered 29/3, 2017 at 17:45 Comment(0)

Here is a method using data.table

library(data.table)

setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

and

setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    e right.e correct.e
2:    f      NA correct.f
3:    g right.g correct.g

The idea in var2[!is.na(var2)][1] for example, to take the first non-missing value from var2. If all values are missing, then this returns NA. This operation is performed for both variables by var1.

If you have more than two variables, you might switch to lapply. For example, the following.

df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
   var1    var2      var3
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

In an instance where more than one var1 has a valid value, and this is indicated by a non-missing var2, then you can reach the intended result with a join.

The data from the comment,

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
                var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
                var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

Then, with this data,

setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
   var1     var2       var3
1:    a  right.a  correct.a
2:    a right.a1 correct.a1
3:    b  right.b  correct.b
4:    c  right.c  correct.c
5:    d  right.d  correct.d

Here, all non-missing var2 observations by var1 are merged onto the original data set.

Evangelia answered 29/3, 2017 at 18:12 Comment(4)

Thank you very much, this looks promising. Any idea on how to deal with the duplicates in the dataframe (i.e. observations with the same ID grouped at random positions in the dataframe)? If df1 looks like this:

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

the code unfortunately just takes the first "a" and not the second. – Sophocles 29/3, 2017 at 19:35

See the additional text at the end of the answer. – Evangelia 29/3, 2017 at 19:48

Your last line of code including the join throws me this error: Error in [.data.table

(setDT(df1), df1[, .(var2 = var2[!is.na(var2)]),  :    'on' argument should be a named atomic vector oc column names indicating which columns in 'i' should be joined with which columns in 'x'.

– Sophocles 29/3, 2017 at 20:12

Just tried it again with a fresh session of R and it worked. You probably have to update the version of data.table you are using. I'm using 1.10.4. If that is not an option, you might try replacing on=.(var1, var2) with on=c("var1", "var2"). – Evangelia 29/3, 2017 at 20:15

If var2 and var3 have only one unique value for each level of var1, then:

library(dplyr)

df = rbind(df1,df2)

df %>% group_by(var1) %>%
  summarise_all(funs(.[!is.na(.)][1]))

   var1    var2      var3
1     a right.a correct.a
2     b right.b correct.b
3     c right.c correct.c
4     d right.d correct.d
5     e right.e correct.e
6     f    <NA> correct.f
7     g right.g correct.g

Algeciras answered 29/3, 2017 at 17:55 Comment(3)

Unfortunately, this takes the "right.g" into the f-row (as there is no "right.f" in my dataset). – Sophocles 29/3, 2017 at 17:57

Turns out the problem was related to a column type issue when all values in a group are NA (as is the case for var2 here when var1=="f"). I've fixed it by using NA_character_ instead of NA. – Algeciras 29/3, 2017 at 18:16

Updated to adapt @Imo's shorter code for dealing with groups that are all NA. – Algeciras 29/3, 2017 at 18:49

Recommended topics

Hot tags