Merge multiple data tables with duplicate column names

Asked 11/9, 2015 at 15:21 Answered 23/1, 2023 at 9:28

Solved r join merge duplicates data.table

I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:

# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)

# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.

# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))

    x y.x y.y y.x y.y 
 1: a  10  11  12  13 
 2: b  11  12  13  14 
 3: c  12  13  14  15 
 4: d  13  14  15  16 
 5: e  14  15  16  17 
 6: f  15  16  17  18

I have a workaround, where, if I change the 2nd column name for DT1:

setnames(DT1, "y", "new_y")

# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?

Basidium answered 11/9, 2015 at 15:21 Comment(15)

possible duplicate of #28379137 – Copeland 11/9, 2015 at 15:31

"Works fine"? Good luck with two y.ys, etc. – Bema 11/9, 2015 at 15:32

It looks like it happens because the Reduce function can't give a new column name. In other words, you run out of combinations of x and y (there are only 4 possible combinations so the error occurs when working with a 5th column). – Beaux 11/9, 2015 at 15:34

@Bema Yes, but I can use setnames() after I get a merged data table (I read the 5 files always in a particular order), so that's not an issue for me. – Basidium 11/9, 2015 at 15:37

@bourbaki4481472 that makes sense – Basidium 11/9, 2015 at 15:39

@Basidium - that's a bad idea - do the setnames before/during the join, not after – Mayers 11/9, 2015 at 15:39

Can the x vector contain repeated values? Is it identical in all DTs? – Bema 11/9, 2015 at 15:39

@Bema yes, x is exactly the same in all DTs. There are no repeated values. – Basidium 11/9, 2015 at 15:42

@Mayers setnames() during join would be my choice too, but how can I do that? – Basidium 11/9, 2015 at 15:44

@Basidium if x is exactly the same for all of them, you shouldn't be merging – Mayers 11/9, 2015 at 16:17

@Mayers you're right, I can do cbind when x is exactly the same. I started using merge because I thought it would take care of any missing rows in some of the DTs, so yes, to clarify, x might not be the same in all DTs, but definitely has no repeated values. Sorry for the confusion! – Basidium 11/9, 2015 at 16:34

@Bema see my comment above regarding x, sorry for mis-stating the case – Basidium 11/9, 2015 at 16:35

Okay, thanks for clarifying. Although it's best to make your question representative of the problem you're actually facing, I think you got lucky this time and all the answers happen to still work in that case, except maybe the first part of Jaap's. – Bema 11/9, 2015 at 16:40

To be clear, you want multiple y columns with automatic names? Like y.1 y.2 y.3 y.4 y.5? And given you know that only column y is duplicated. So we don't need to auto-detect duplicate columns, and auto-rename/number them. – Creator 4/5, 2018 at 1:18

data.table::merge() has an argument suffixes = c(".x", ".y"), so you can avoid the rename step, just pass the corresponding number suffix already. – Creator 4/5, 2018 at 1:21

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19

Mayers answered 11/9, 2015 at 16:9 Comment(2)

What's up with the parentheses around and following the function definition, like (function()...)()? – Bema 11/9, 2015 at 16:27

@Bema it's a closure the outer function creates an environment, and returns the inner function, which is what those parentheses extract – Mayers 11/9, 2015 at 16:35

If it's just those 5 datatables (where x is the same for all datatables), you could also use nested joins:

# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)

# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]

Or as @Frank said in the comments:

DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)

which gives:

   x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19

This gives the same result as:

mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

> identical(mergedDT1,mergedDT2)
[1] TRUE

When your x columns do not have the same values, a nested join will not give the desired solution:

DT1[DT2[DT3[DT4[DT5[DT6]]]]]

this gives:

   x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20

While:

Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))

gives:

   x y1 y2 y3 y4 y5 y6
1: a 10 11 12 13 14 NA
2: b 11 12 13 14 15 15
3: c 12 13 14 15 16 16
4: d 13 14 15 16 17 17
5: e 14 15 16 17 18 18
6: f 15 16 17 18 19 19
7: g NA NA NA NA NA 20

Used data:

In order to make the code with Reduce work, I changed the names of the y columns.

DT1 <- data.table(x = letters[1:6], y1 = 10:15)
DT2 <- data.table(x = letters[1:6], y2 = 11:16)
DT3 <- data.table(x = letters[1:6], y3 = 12:17)
DT4 <- data.table(x = letters[1:6], y4 = 13:18)
DT5 <- data.table(x = letters[1:6], y5 = 14:19)

DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")

Authoritative answered 11/9, 2015 at 15:40 Comment(6)

this is not the same as a merge with all=TRUE – Mayers 11/9, 2015 at 15:43

Nothing surprising about it working on the toy example. The two will diverge once you add values of x that are not all identical between all 5. merge with all = TRUE does an outer join, whereas [ does a one-sided one. – Mayers 11/9, 2015 at 15:57

Your first one is aka Reduce(function(X,Y) X[Y], DTlist) – Bema 11/9, 2015 at 16:10

@Mayers True, but the OP said in the comments that x is exactly the same for all DT's. I've updated my answer to show that with non-indetical x values this does not work. – Authoritative 11/9, 2015 at 16:14

ok, though if x is exactly the same, doing a merge is silly – Mayers 11/9, 2015 at 16:16

@Bema Thanx, added it to my answer. – Authoritative 11/9, 2015 at 16:24

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19

Mayers answered 11/9, 2015 at 16:9 Comment(2)

What's up with the parentheses around and following the function definition, like (function()...)()? – Bema 11/9, 2015 at 16:27

@Bema it's a closure the outer function creates an environment, and returns the inner function, which is what those parentheses extract – Mayers 11/9, 2015 at 16:35

stack and reshape I don't think this maps exactly to the merge function but...

mycols <- "x"
DTlist <- list(DT1,DT2,DT3,DT4,DT5)

dcast(rbindlist(DTlist,idcol=TRUE), paste0(paste0(mycols,collapse="+"),"~.id"))

#    x  1  2  3  4  5
# 1: a 10 11 12 13 14
# 2: b 11 12 13 14 15
# 3: c 12 13 14 15 16
# 4: d 13 14 15 16 17
# 5: e 14 15 16 17 18
# 6: f 15 16 17 18 19

I have no sense for if this would extend to having more columns than y.

merge-assign

DT <- Reduce(function(...) merge(..., all = TRUE, by = mycols), 
  lapply(DTlist,`[.noquote`,mycols))

for (k in seq_along(DTlist)){
  js = setdiff( names(DTlist[[k]]), mycols )
  DT[DTlist[[k]], paste0(js,".",k) := mget(paste0("i.",js)), on=mycols, by=.EACHI]
}

#    x y.1 y.2 y.3 y.4 y.5
# 1: a  10  11  12  13  14
# 2: b  11  12  13  14  15
# 3: c  12  13  14  15  16
# 4: d  13  14  15  16  17
# 5: e  14  15  16  17  18
# 6: f  15  16  17  18  19

(I'm not sure if this fully extends to other cases. Hard to say because the OP's example really doesn't demand the full functionality of merge. In the OP's case, with mycols="x" and x being the same across all DT*, obviously a merge is inappropriate, as mentioned by @eddi. The general problem is interesting, though, so that's what I'm trying to attack here.)

Bema answered 11/9, 2015 at 15:57 Comment(0)

Using reshaping gives you a lot more flexibility in how you want to name your columns.

library(dplyr)
library(tidyr)

list(DT1, DT2, DT3, DT4, DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste("y", source, sep = ".")) %>%
  spread(source, y)

Or, this would work

library(dplyr)
library(tidyr)

list(DT1 = DT1, DT2 = DT2, DT3 = DT3, DT4 = DT4, DT5 = DT5) %>%
  bind_rows(.id = "source") %>%
  mutate(source = paste(source, "y", sep = ".")) %>%
  spread(source, y)

Isolated answered 11/9, 2015 at 15:46 Comment(2)

After the bind_rows there is no source column, so I see the Error: cannot coerce type 'closure' to vector of type 'character' (since source is a function). Not sure the fix for this... presumably you're misusing bind_rows...? – Bema 11/9, 2015 at 16:3

The .id feature is new in dplyr 0.4.3. Is that the version you are using? – Isolated 11/9, 2015 at 16:29

Another way of doing this:

dts <- list(DT1, DT2, DT3, DT4, DT5)

names(dts) <- paste("y", seq_along(dts), sep="")
data.table::dcast(rbindlist(dts, idcol="id"), x ~ id, value.var = "y")

#   x y1 y2 y3 y4 y5
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19

The package name in "data.table::dcast" is added to ensure that the call returns a data table and not a data frame even if the "reshape2" package is loaded as well. Without mentioning the package name explicitly, the dcast function from the reshape2 package might be used which works on a data.frame and returns a data.frame instead of a data.table.

Terresaterrestrial answered 3/3, 2016 at 10:35 Comment(0)

Alternatively you could setNames for the columns before and do merge like this

dts = list(DT1, DT2, DT3, DT4, DT5)
names(dts) = paste('DT', c(1:5), sep = '')    

dtlist = lapply(names(dts),function(i) 
         setNames(dts[[i]], c('x', paste('y',i,sep = '.'))))

Reduce(function(...) merge(..., all = T), dtlist)

#   x y.DT1 y.DT2 y.DT3 y.DT4 y.DT5
#1: a    10    11    12    13    14
#2: b    11    12    13    14    15
#3: c    12    13    14    15    16
#4: d    13    14    15    16    17
#5: e    14    15    16    17    18
#6: f    15    16    17    18    19

Shul answered 11/9, 2015 at 16:14 Comment(5)

Fyi, it is not necessary to give names to dts; you can already refer to them by 1:5. Also, you probably want a for loop with setnames, not setNames (mentioned by eddi in a comment on the q and used by the op). – Bema 11/9, 2015 at 16:15

@Bema yes thanks! I just thought to include that step thinking if OP may want to put data.table names instead of just numbers, in that way the ultimate data.table would be more informative right – Shul 11/9, 2015 at 16:18

@Bema I didn't understand why for loop? – Shul 11/9, 2015 at 16:23

The data.table function setnames operates by reference, modifying the object itself, so there is no need to assign the function's value/result. – Bema 11/9, 2015 at 16:24

@Bema Oh, that's when I am using data.table's setnames right.. OK cool! – Shul 11/9, 2015 at 16:28

This is an alternative solution - you can define join columns each time (when your x columns do not have the same values) . You need to define vectors with column names. Then you may chain joining by reference like this:

cols_dt1 <- colnames(dt_dt1)[!colnames(dt_dt1) %in% 'join_column1']
cols_dt2 <- colnames(dt_dt2)[!colnames(dt_dt2) %in% ' join_column2']
cols_dt3 <- colnames(dt_dt3)[!colnames(dt_dt3) %in% ' join_column3']
cols_dt4 <- colnames(dt_dt4)[!colnames(dt_dt4) %in% ' join_column4']
cols_dt5 <- colnames(dt_dt5)[!colnames(dt_dt5) %in% ' join_column5']

data_dt[dt_dt1, on=.( join_column1), (cols_dt1) := mget(cols_dt1)][
  dt_dt2, on=.( join_column2), (cols_dt2) := mget(cols_dt2)][
    dt_dt3, on=.( join_column3), (cols_dt3) := mget(cols_dt3)][
      dt_dt4, on=.( join_column4), (cols_dt4) := mget(cols_dt4)][
        dt_dt5, on=.( join_column5), (cols_dt5) := mget(cols_dt5)]

Kimikokimitri answered 23/1, 2023 at 9:28 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags