I have a very messy dataframe (webscraped) that unfortunately has many double and even triple entries in it. Most of the dataframe looks like this:
> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
var1 var2 var3
1 a right.a correct.a
2 a <NA> correct.a
3 b right.b correct.b
4 b <NA> correct.b
5 c right.c correct.c
6 c <NA> correct.c
7 d right.d correct.d
8 d <NA> correct.d
"var1" is my ID variable that I need to use to aggregate. My goal is to have a dataframe that looks like this:
var1 var2 var3
1 a right.a correct.a
2 b right.b correct.b
3 c right.c correct.c
4 d right.d correct.d
However, the main problem is, that not the whole dataframe looks like this. In fact, I have other parts that look like this:
> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
var1 var2 var3
1 e <NA> correct.e
2 e <NA> correct.e
3 e right.e <NA>
4 f <NA> correct.f
5 f <NA> <NA>
6 g <NA> correct.g
7 g right.g wrong.g
8 g <NA> <NA>
And other variations. In the end, every ID should have one row with the correct and right var2 and var3 in it. At this point, I get lost: My var1 is not unique. However, I know that duplicate IDs that "belong" together are grouped in the dataframe (as seen in my examples); e.g. there might be another "a" in row 4102 and 4103.
What I think would be the way to go is using aggregate with var1 as ID,but in addition telling R that aggregate should just check +-2 rows of var1 when doing so. Any ideas how to code this?
Thanks!
df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))
the code unfortunately just takes the first "a" and not the second. – Sophocles