I have the following dataframes:
df1 = DataFrame(
col_A = [1, 2, 3, 4, 5, 6, 7],
col_B = ["A", "B", "C", "D", "E", "F", "G"],
col_C = missing,
)
7×3 DataFrame
Row │ col_A col_B col_C
│ Int64 String Missing
─────┼────────────────────────
1 │ 1 "A" missing
2 │ 2 "B" missing
3 │ 3 "C" missing
4 │ 4 "D" missing
5 │ 5 "E" missing
6 │ 6 "F" missing
7 │ 7 "G" missing
df2 = DataFrame(
col_X = [1, 2, 3, 4, 5, 5],
col_Y = ["A", "nope", "C", "nope", "E", "E"],
col_Z = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]
)
6×3 DataFrame
Row │ col_X col_Y col_Z
│ Int64 String String
─────┼───────────────────────────
1 │ 1 "A" "First"
2 │ 2 "nope" "Second"
3 │ 3 "C" "Third"
4 │ 4 "nope" "Fourth"
5 │ 5 "E" "Fifth"
6 │ 5 "E" "Duplicated"
I need to efficiently replace the values of df1.col_C
with those of df2.col_Z
if there's a match between -let's say- composite keys made from the first 2 columns in both dataframes (e.g. (1, "A")
occurs in both, but (2, "B")
doesn't), and otherwise keep things unchanged. If there were duplicated composite keys, get the last occurrence in df2
.
So df1
would become:
7×3 DataFrame
Row │ col_A col_B col_C
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 "A" "First"
2 │ 2 "B" missing
3 │ 3 "C" "Third"
4 │ 4 "D" missing
5 │ 5 "E" "Duplicated"
6 │ 6 "F" missing
7 │ 7 "G" missing