Merge multiple data tables with duplicate column names
Asked Answered
B

7

13

I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:

# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)

# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.

# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))

    x y.x y.y y.x y.y 
 1: a  10  11  12  13 
 2: b  11  12  13  14 
 3: c  12  13  14  15 
 4: d  13  14  15  16 
 5: e  14  15  16  17 
 6: f  15  16  17  18

I have a workaround, where, if I change the 2nd column name for DT1:

setnames(DT1, "y", "new_y")

# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?

Basidium answered 11/9, 2015 at 15:21 Comment(15)
possible duplicate of #28379137Copeland
"Works fine"? Good luck with two y.ys, etc.Bema
It looks like it happens because the Reduce function can't give a new column name. In other words, you run out of combinations of x and y (there are only 4 possible combinations so the error occurs when working with a 5th column).Beaux
@Bema Yes, but I can use setnames() after I get a merged data table (I read the 5 files always in a particular order), so that's not an issue for me.Basidium
@bourbaki4481472 that makes senseBasidium
@Basidium - that's a bad idea - do the setnames before/during the join, not afterMayers
Can the x vector contain repeated values? Is it identical in all DTs?Bema
@Bema yes, x is exactly the same in all DTs. There are no repeated values.Basidium
@Mayers setnames() during join would be my choice too, but how can I do that?Basidium
@Basidium if x is exactly the same for all of them, you shouldn't be mergingMayers
@Mayers you're right, I can do cbind when x is exactly the same. I started using merge because I thought it would take care of any missing rows in some of the DTs, so yes, to clarify, x might not be the same in all DTs, but definitely has no repeated values. Sorry for the confusion!Basidium
@Bema see my comment above regarding x, sorry for mis-stating the caseBasidium
Okay, thanks for clarifying. Although it's best to make your question representative of the problem you're actually facing, I think you got lucky this time and all the answers happen to still work in that case, except maybe the first part of Jaap's.Bema
To be clear, you want multiple y columns with automatic names? Like y.1 y.2 y.3 y.4 y.5? And given you know that only column y is duplicated. So we don't need to auto-detect duplicate columns, and auto-rename/number them.Creator
data.table::merge() has an argument suffixes = c(".x", ".y"), so you can avoid the rename step, just pass the corresponding number suffix already.Creator
M
7

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19
Mayers answered 11/9, 2015 at 16:9 Comment(2)
What's up with the parentheses around and following the function definition, like (function()...)()?Bema
@Bema it's a closure the outer function creates an environment, and returns the inner function, which is what those parentheses extractMayers
A
9

If it's just those 5 datatables (where x is the same for all datatables), you could also use nested joins:

# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)

# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]

Or as @Frank said in the comments:

DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)

which gives:

   x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19

This gives the same result as:

mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

> identical(mergedDT1,mergedDT2)
[1] TRUE

When your x columns do not have the same values, a nested join will not give the desired solution:

DT1[DT2[DT3[DT4[DT5[DT6]]]]]

this gives:

   x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20

While:

Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))

gives:

   x y1 y2 y3 y4 y5 y6
1: a 10 11 12 13 14 NA
2: b 11 12 13 14 15 15
3: c 12 13 14 15 16 16
4: d 13 14 15 16 17 17
5: e 14 15 16 17 18 18
6: f 15 16 17 18 19 19
7: g NA NA NA NA NA 20

Used data:

In order to make the code with Reduce work, I changed the names of the y columns.

DT1 <- data.table(x = letters[1:6], y1 = 10:15)
DT2 <- data.table(x = letters[1:6], y2 = 11:16)
DT3 <- data.table(x = letters[1:6], y3 = 12:17)
DT4 <- data.table(x = letters[1:6], y4 = 13:18)
DT5 <- data.table(x = letters[1:6], y5 = 14:19)

DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")
Authoritative answered 11/9, 2015 at 15:40 Comment(6)
this is not the same as a merge with all=TRUEMayers
Nothing surprising about it working on the toy example. The two will diverge once you add values of x that are not all identical between all 5. merge with all = TRUE does an outer join, whereas [ does a one-sided one.Mayers
Your first one is aka Reduce(function(X,Y) X[Y], DTlist)Bema
@Mayers True, but the OP said in the comments that x is exactly the same for all DT's. I've updated my answer to show that with non-indetical x values this does not work.Authoritative
ok, though if x is exactly the same, doing a merge is sillyMayers
@Bema Thanx, added it to my answer.Authoritative
M
7

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19
Mayers answered 11/9, 2015 at 16:9 Comment(2)
What's up with the parentheses around and following the function definition, like (function()...)()?Bema
@Bema it's a closure the outer function creates an environment, and returns the inner function, which is what those parentheses extractMayers
B
5

stack and reshape I don't think this maps exactly to the merge function but...

mycols <- "x"
DTlist <- list(DT1,DT2,DT3,DT4,DT5)

dcast(rbindlist(DTlist,idcol=TRUE), paste0(paste0(mycols,collapse="+"),"~.id"))

#    x  1  2  3  4  5
# 1: a 10 11 12 13 14
# 2: b 11 12 13 14 15
# 3: c 12 13 14 15 16
# 4: d 13 14 15 16 17
# 5: e 14 15 16 17 18
# 6: f 15 16 17 18 19

I have no sense for if this would extend to having more columns than y.

merge-assign

DT <- Reduce(function(...) merge(..., all = TRUE, by = mycols), 
  lapply(DTlist,`[.noquote`,mycols))

for (k in seq_along(DTlist)){
  js = setdiff( names(DTlist[[k]]), mycols )
  DT[DTlist[[k]], paste0(js,".",k) := mget(paste0("i.",js)), on=mycols, by=.EACHI]
}

#    x y.1 y.2 y.3 y.4 y.5
# 1: a  10  11  12  13  14
# 2: b  11  12  13  14  15
# 3: c  12  13  14  15  16
# 4: d  13  14  15  16  17
# 5: e  14  15  16  17  18
# 6: f  15  16  17  18  19

(I'm not sure if this fully extends to other cases. Hard to say because the OP's example really doesn't demand the full functionality of merge. In the OP's case, with mycols="x" and x being the same across all DT*, obviously a merge is inappropriate, as mentioned by @eddi. The general problem is interesting, though, so that's what I'm trying to attack here.)

Bema answered 11/9, 2015 at 15:57 Comment(0)
I
3

Using reshaping gives you a lot more flexibility in how you want to name your columns.

library(dplyr)
library(tidyr)

list(DT1, DT2, DT3, DT4, DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste("y", source, sep = ".")) %>%
  spread(source, y)

Or, this would work

library(dplyr)
library(tidyr)

list(DT1 = DT1, DT2 = DT2, DT3 = DT3, DT4 = DT4, DT5 = DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste(source, "y", sep = ".")) %>%
  spread(source, y)
Isolated answered 11/9, 2015 at 15:46 Comment(2)
After the bind_rows there is no source column, so I see the Error: cannot coerce type 'closure' to vector of type 'character' (since source is a function). Not sure the fix for this... presumably you're misusing bind_rows...?Bema
The .id feature is new in dplyr 0.4.3. Is that the version you are using?Isolated
T
2

Another way of doing this:

dts <- list(DT1, DT2, DT3, DT4, DT5)

names(dts) <- paste("y", seq_along(dts), sep="")
data.table::dcast(rbindlist(dts, idcol="id"), x ~ id, value.var = "y")

#   x y1 y2 y3 y4 y5
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19

The package name in "data.table::dcast" is added to ensure that the call returns a data table and not a data frame even if the "reshape2" package is loaded as well. Without mentioning the package name explicitly, the dcast function from the reshape2 package might be used which works on a data.frame and returns a data.frame instead of a data.table.

Terresaterrestrial answered 3/3, 2016 at 10:35 Comment(0)
S
1

Alternatively you could setNames for the columns before and do merge like this

dts = list(DT1, DT2, DT3, DT4, DT5)
names(dts) = paste('DT', c(1:5), sep = '')    

dtlist = lapply(names(dts),function(i) 
         setNames(dts[[i]], c('x', paste('y',i,sep = '.'))))

Reduce(function(...) merge(..., all = T), dtlist)

#   x y.DT1 y.DT2 y.DT3 y.DT4 y.DT5
#1: a    10    11    12    13    14
#2: b    11    12    13    14    15
#3: c    12    13    14    15    16
#4: d    13    14    15    16    17
#5: e    14    15    16    17    18
#6: f    15    16    17    18    19
Shul answered 11/9, 2015 at 16:14 Comment(5)
Fyi, it is not necessary to give names to dts; you can already refer to them by 1:5. Also, you probably want a for loop with setnames, not setNames (mentioned by eddi in a comment on the q and used by the op).Bema
@Bema yes thanks! I just thought to include that step thinking if OP may want to put data.table names instead of just numbers, in that way the ultimate data.table would be more informative rightShul
@Bema I didn't understand why for loop?Shul
The data.table function setnames operates by reference, modifying the object itself, so there is no need to assign the function's value/result.Bema
@Bema Oh, that's when I am using data.table's setnames right.. OK cool!Shul
K
0

This is an alternative solution - you can define join columns each time (when your x columns do not have the same values) . You need to define vectors with column names. Then you may chain joining by reference like this:

cols_dt1 <- colnames(dt_dt1)[!colnames(dt_dt1) %in% 'join_column1']
cols_dt2 <- colnames(dt_dt2)[!colnames(dt_dt2) %in% ' join_column2']
cols_dt3 <- colnames(dt_dt3)[!colnames(dt_dt3) %in% ' join_column3']
cols_dt4 <- colnames(dt_dt4)[!colnames(dt_dt4) %in% ' join_column4']
cols_dt5 <- colnames(dt_dt5)[!colnames(dt_dt5) %in% ' join_column5']

data_dt[dt_dt1, on=.( join_column1), (cols_dt1) := mget(cols_dt1)][
  dt_dt2, on=.( join_column2), (cols_dt2) := mget(cols_dt2)][
    dt_dt3, on=.( join_column3), (cols_dt3) := mget(cols_dt3)][
      dt_dt4, on=.( join_column4), (cols_dt4) := mget(cols_dt4)][
        dt_dt5, on=.( join_column5), (cols_dt5) := mget(cols_dt5)]
Kimikokimitri answered 23/1, 2023 at 9:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.