R: Updating a data frame with another data frame
Asked Answered
M

5

5

Let's say our initial data frame looks like this:

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,NA,NA,NA),C=c(1,2,3,NA,NA,NA))

> df1
  Index A  B  C
1     1 1  1  1
2     2 2  2  2
3     3 3  3  3
4     4 4 NA NA
5     5 5 NA NA
6     6 6 NA NA

Another data frame contains new information for col B and C

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5))

> df2
  Index B C
1     4 4 5
2     5 4 5
3     6 4 5

How can you update the missing values in df1 so it looks like this:

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

My attempt:

library(dplyr)

> full_join(df1,df2)
Joining by: c("Index", "B", "C")
  Index  A  B  C
1     1  1  1  1
2     2  2  2  2
3     3  3  3  3
4     4  4 NA NA
5     5  5 NA NA
6     6  6 NA NA
7     4 NA  4  5
8     5 NA  4  5
9     6 NA  4  5

Which as you can see has created duplicate rows for the 4,5,6 index instead of replacing the NA values.

Any help would be greatly appreciated!

Misfile answered 19/5, 2016 at 1:47 Comment(0)
S
6

merge then aggregate:

aggregate(. ~ Index, data=merge(df1, df2, all=TRUE), na.omit, na.action=na.pass )

#  Index B C A
#1     1 1 1 1
#2     2 2 2 2
#3     3 3 3 3
#4     4 4 5 4
#5     5 4 5 5
#6     6 4 5 6

Or in dplyr speak:

df1 %>% 
    full_join(df2) %>%
    group_by(Index) %>%
    summarise_each(funs(na.omit))

#Joining by: c("Index", "B", "C")
#Source: local data frame [6 x 4]
#
#  Index     A     B     C
#  (dbl) (int) (dbl) (dbl)
#1     1     1     1     1
#2     2     2     2     2
#3     3     3     3     3
#4     4     4     4     5
#5     5     5     4     5
#6     6     6     4     5
Stoner answered 19/5, 2016 at 2:55 Comment(2)
Thanks, it worked great! I really need to learn more about the aggregate function. Is there also a way to update entries that aren't necessarily NA?Misfile
I tried the aggregate option where the master file (MF) had 9634 obs of 430 variables and the update file (UF) had 9249 obs of 4 identically named variables. The result file has 9224 obs of 430 variables (So considerable loss of masterfile data). The dplyr option (which was MUCH slower) [warning summarise_each() deprecated in dplyr 0.7.0] - and in fact never completed. So be warned you will lose NA data from your master file with this solution.Marnimarnia
G
5

We can use join from data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), join on with 'df1' using "Index" and assign (:=), the values in 'B' and 'C' with 'i.B' and 'i.C'.

library(data.table)
setDT(df1)[df2, c('B', 'C') := .(i.B, i.C), on = "Index"]
df1
#   Index A B C
#1:     1 1 1 1
#2:     2 2 2 2
#3:     3 3 3 3
#4:     4 4 4 5
#5:     5 5 4 5
#6:     6 6 4 5
Godfree answered 19/5, 2016 at 2:18 Comment(4)
Hi akrun, I couldn't get this solution to work for meMisfile
@Misfile My solution was based on the example you provided. It works for me with data.table 1.9.6. Also, as your comment was not very informative (i.e. not providing any error message or anything), I can't comment.Godfree
Error in [.data.table(setDT(df1), df2, :=(c("B", "C"), .(i.B, i.C)), : unused argument (on = "Index")Misfile
@Misfile It could be because you are using an older version of data.table. Please update to new versionGodfree
C
4

As of dplyr >= 1.0.0 you can use rows_update:

library(dplyr)

df1 %>% 
  rows_update(df2, by = "Index")

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

Alternatively, there is rows_patch:

rows_patch() works like rows_update() but only overwrites NA values.

Coo answered 11/5, 2022 at 16:54 Comment(1)
This only work if the both data frame has same columnsNumberless
M
3

For those interested, I've extended this problem to:

- handle updating a data frame with another data frame with new columns

- replace any existing entries regardless if they're NA or not.

Heres the solution I found using the aggregate function from @thelatemail :)

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,3,3,3),C=c(1,2,3,3,3,3))

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5),D=c(6,6,6),E=c(7,7,7))

df3 = full_join(df1,df2)

# Create a function na.omit.last 
na.omit.last = function(x){
  x <- na.omit(x)
  x <- last(x)
}

# For the columns not in df1 
dfA = aggregate(. ~ Index, df3, na.omit,na.action = na.pass)
dfA = dfA[,-(1:ncol(df1))] 
dfA = data.frame(lapply(dfA,as.numeric))

dfB = aggregate(. ~ Index, df3[,1:ncol(df1)], na.omit.last, na.action = na.pass)

# If there are more columns in df2 append dfA
if (ncol(df2) > ncol(df1)) {
  df3 = cbind(dfB,dfA)
}  else {
    df3 = dfB
  }

print(df3)
Misfile answered 20/5, 2016 at 8:4 Comment(0)
O
2

Not sure what the general case or conditions would be, but this works for this instance without dplyr

df3 <- as.matrix(df1)
df3[which(is.na(df3))] <- as.matrix(df2)
df3 <- as.data.frame(df3)
df3

  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 5
5 5 4 5
6 6 4 5
Osrick answered 19/5, 2016 at 1:59 Comment(1)
thanks this could be useful - i've updated the question and added an index column thoughMisfile

© 2022 - 2024 — McMap. All rights reserved.