Combine two data frames by rows (rbind) when they have different sets of columns
Asked Answered
B

15

326

Is it possible to row bind two data frames that don't have the same set of columns? I am hoping to retain the columns that do not match after the bind.

Buonarroti answered 4/8, 2010 at 3:25 Comment(0)
R
278

rbind.fill from the package plyr might be what you are looking for.

Reggiereggis answered 4/8, 2010 at 4:0 Comment(4)
rbind.fill and bind_rows() both silently drop rownames.Apartheid
@Apartheid Hadley: "Yes, all dplyr methods ignore rownames."Perspicacity
Here is a link to documentation: rdocumentation.org/packages/plyr/versions/1.8.4/topics/…Numbersnumbfish
using do.call(plyr::rbind.fill, myList) saved my day. myList is a list of tablesHasa
S
221

A more recent solution is to use dplyr's bind_rows function which I assume is more efficient than smartbind.

df1 <- data.frame(a = c(1:5), b = c(6:10))
df2 <- data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])
dplyr::bind_rows(df1, df2)
    a  b    c
1   1  6 <NA>
2   2  7 <NA>
3   3  8 <NA>
4   4  9 <NA>
5   5 10 <NA>
6  11 16    A
7  12 17    B
8  13 18    C
9  14 19    D
10 15 20    E
Shuma answered 7/1, 2015 at 2:33 Comment(3)
I am trying to combine a large number of dataframes (16) with different column names When I try this I get an error Error: Column ABC can't be converted from character to numeric. Is there a way to convert the columns first?Stellite
@sar: df$column <- as.character(df$column). Also see dplyr.tidyverse.org/reference/mutate_all.htmlPontiac
modern dplyr implementation would be ... %>% mutate(across(c(char_column1, char_column2), ~ as.numeric(.x)) %>% ...Anhedral
M
77

Most of the base R answers address the situation where only one data.frame has additional columns or that the resulting data.frame would have the intersection of the columns. Since the OP writes I am hoping to retain the columns that do not match after the bind, an answer using base R methods to address this issue is probably worth posting.

Below, I present two base R methods: One that alters the original data.frames, and one that doesn't. Additionally, I offer a method that generalizes the non-destructive method to more than two data.frames.

First, let's get some sample data.

# sample data, variable c is in df1, variable d is in df2
df1 = data.frame(a=1:5, b=6:10, d=month.name[1:5])
df2 = data.frame(a=6:10, b=16:20, c = letters[8:12])

Two data.frames, alter originals
In order to retain all columns from both data.frames in an rbind (and allow the function to work without resulting in an error), you add NA columns to each data.frame with the appropriate missing names filled in using setdiff.

# fill in non-overlapping columns with NAs
df1[setdiff(names(df2), names(df1))] <- NA
df2[setdiff(names(df1), names(df2))] <- NA

Now, rbind-em

rbind(df1, df2)
    a  b        d    c
1   1  6  January <NA>
2   2  7 February <NA>
3   3  8    March <NA>
4   4  9    April <NA>
5   5 10      May <NA>
6   6 16     <NA>    h
7   7 17     <NA>    i
8   8 18     <NA>    j
9   9 19     <NA>    k
10 10 20     <NA>    l

Note that the first two lines alter the original data.frames, df1 and df2, adding the full set of columns to both.


Two data.frames, do not alter originals
To leave the original data.frames intact, first loop through the names that differ, return a named vector of NAs that are concatenated into a list with the data.frame using c. Then, data.frame converts the result into an appropriate data.frame for the rbind.

rbind(
  data.frame(c(df1, sapply(setdiff(names(df2), names(df1)), function(x) NA))),
  data.frame(c(df2, sapply(setdiff(names(df1), names(df2)), function(x) NA)))
)

Many data.frames, do not alter originals
In the instance that you have more than two data.frames, you could do the following.

# put data.frames into list (dfs named df1, df2, df3, etc)
mydflist <- mget(ls(pattern="df\\d+"))
# get all variable names
allNms <- unique(unlist(lapply(mydflist, names)))

# put em all together
do.call(rbind,
        lapply(mydflist,
               function(x) data.frame(c(x, sapply(setdiff(allNms, names(x)),
                                                  function(y) NA)))))

Maybe a bit nicer to not see the row names of original data.frames? Then do this.

do.call(rbind,
        c(lapply(mydflist,
                 function(x) data.frame(c(x, sapply(setdiff(allNms, names(x)),
                                                    function(y) NA)))),
          make.row.names=FALSE))
Malleolus answered 8/10, 2017 at 20:16 Comment(5)
I have 16 dataframes some with different columns (approximately 70-90 total columns in each). When I try this, I get stuck with the first command <- mget(ls(pattern="df\\d+")). My dataframes have different names. I tried making a list using mydflist <- c(as,dr,kr, hyt, ed1, of) but this gave me an enormous list.Stellite
Just linking to @GKiStellite
@Stellite use mydflist <- list(as, dr, kr, hyt, ed1, of). This should construct a list object that does not grow the size of your environment, but just points to each element of the list (as long as you do not alter any of the contents afterward). After the operation, remove the list object, just to be safe.Malleolus
Great to have a base R solution, but I've found that do.call() solution (for many dataframes) to be very slow. Any idea what could make it faster?Goncourt
If you have a bunch of data.frames and are really need the speed, you might consider switching to data.table's rbindlist. It is highly optimized.Malleolus
P
53

An alternative with data.table:

library(data.table)
df1 = data.frame(a = c(1:5), b = c(6:10))
df2 = data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])
rbindlist(list(df1, df2), fill = TRUE)

rbind will also work in data.table as long as the objects are converted to data.table objects, so

rbind(setDT(df1), setDT(df2), fill=TRUE)

will also work in this situation. This can be preferable when you have a couple of data.tables and don't want to construct a list.

Parachronism answered 22/2, 2016 at 1:51 Comment(1)
This is the most simple, out-of-the-box solution that easily generalizes to any number of dataframes, since you can store them all in separate list elements. Other answers, like the intersect approach, only work for 2 dataframes and don't easily generalize.Quarto
B
52

You can use smartbind from the gtools package.

Example:

library(gtools)
df1 <- data.frame(a = c(1:5), b = c(6:10))
df2 <- data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])
smartbind(df1, df2)
# result
     a  b    c
1.1  1  6 <NA>
1.2  2  7 <NA>
1.3  3  8 <NA>
1.4  4  9 <NA>
1.5  5 10 <NA>
2.1 11 16    A
2.2 12 17    B
2.3 13 18    C
2.4 14 19    D
2.5 15 20    E
Belted answered 4/8, 2010 at 3:45 Comment(2)
I tried smartbind with two large data frames (in total roughly 3*10^6 rows) and aborted it after 10 minutes.Aiglet
A lot has happened in 9 years :) I might not use smartbind today. Note also that the original question did not specify large data frames.Belted
A
49

If the columns in df1 is a subset of those in df2 (by column names):

df3 <- rbind(df1, df2[, names(df1)])
Aeniah answered 4/8, 2010 at 4:33 Comment(0)
M
23

You could also just pull out the common column names.

> cols <- intersect(colnames(df1), colnames(df2))
> rbind(df1[,cols], df2[,cols])
Milone answered 4/8, 2010 at 3:50 Comment(0)
T
9

I wrote a function to do this because I like my code to tell me if something is wrong. This function will explicitly tell you which column names don't match and if you have a type mismatch. Then it will do its best to combine the data.frames anyway. The limitation is that you can only combine two data.frames at a time.

### combines data frames (like rbind) but by matching column names
# columns without matches in the other data frame are still combined
# but with NA in the rows corresponding to the data frame without
# the variable
# A warning is issued if there is a type mismatch between columns of
# the same name and an attempt is made to combine the columns
combineByName <- function(A,B) {
    a.names <- names(A)
    b.names <- names(B)
    all.names <- union(a.names,b.names)
    print(paste("Number of columns:",length(all.names)))
    a.type <- NULL
    for (i in 1:ncol(A)) {
        a.type[i] <- typeof(A[,i])
    }
    b.type <- NULL
    for (i in 1:ncol(B)) {
        b.type[i] <- typeof(B[,i])
    }
    a_b.names <- names(A)[!names(A)%in%names(B)]
    b_a.names <- names(B)[!names(B)%in%names(A)]
    if (length(a_b.names)>0 | length(b_a.names)>0){
        print("Columns in data frame A but not in data frame B:")
        print(a_b.names)
        print("Columns in data frame B but not in data frame A:")
        print(b_a.names)
    } else if(a.names==b.names & a.type==b.type){
        C <- rbind(A,B)
        return(C)
    }
    C <- list()
    for(i in 1:length(all.names)) {
        l.a <- all.names[i]%in%a.names
        pos.a <- match(all.names[i],a.names)
        typ.a <- a.type[pos.a]
        l.b <- all.names[i]%in%b.names
        pos.b <- match(all.names[i],b.names)
        typ.b <- b.type[pos.b]
        if(l.a & l.b) {
            if(typ.a==typ.b) {
                vec <- c(A[,pos.a],B[,pos.b])
            } else {
                warning(c("Type mismatch in variable named: ",all.names[i],"\n"))
                vec <- try(c(A[,pos.a],B[,pos.b]))
            }
        } else if (l.a) {
            vec <- c(A[,pos.a],rep(NA,nrow(B)))
        } else {
            vec <- c(rep(NA,nrow(A)),B[,pos.b])
        }
        C[[i]] <- vec
    }
    names(C) <- all.names
    C <- as.data.frame(C)
    return(C)
}
Torosian answered 3/2, 2011 at 5:27 Comment(0)
O
3

gtools/smartbind didnt like working with Dates, probably because it was as.vectoring. So here's my solution...

sbind = function(x, y, fill=NA) {
    sbind.fill = function(d, cols){ 
        for(c in cols)
            d[[c]] = fill
        d
    }

    x = sbind.fill(x, setdiff(names(y),names(x)))
    y = sbind.fill(y, setdiff(names(x),names(y)))

    rbind(x, y)
}
Orford answered 13/11, 2013 at 16:22 Comment(1)
using dplyr::bind_rows(x, y) in place of rbind(x,y) keeps the column order based on the first data frame.Asci
B
2

Just for the documentation. You can try the Stack library and its function Stack in the following form:

Stack(df_1, df_2)

I have also the impression that it is faster than other methods for large data sets.

Bloodandthunder answered 15/8, 2017 at 19:48 Comment(0)
S
1

Maybe I completely misread your question, but the "I am hoping to retain the columns that do not match after the bind" makes me think you are looking for a left join or right join similar to an SQL query. R has the merge function that lets you specify left, right, or inner joins similar to joining tables in SQL.

There is already a great question and answer on this topic here: How to join (merge) data frames (inner, outer, left, right)?

Schumann answered 4/8, 2010 at 13:13 Comment(0)
S
1

You could also use sjmisc::add_rows(), which uses dplyr::bind_rows(), but unlike bind_rows(), add_rows() preserves attributes and hence is useful for labelled data.

See following example with a labelled dataset. The frq()-function prints frequency tables with value labels, if the data is labelled.

library(sjmisc)
library(dplyr)

data(efc)
# select two subsets, with some identical and else different columns
x1 <- efc %>% select(1:5) %>% slice(1:10)
x2 <- efc %>% select(3:7) %>% slice(11:20)

str(x1)
#> 'data.frame':    10 obs. of  5 variables:
#>  $ c12hour : num  16 148 70 168 168 16 161 110 28 40
#>   ..- attr(*, "label")= chr "average number of hours of care per week"
#>  $ e15relat: num  2 2 1 1 2 2 1 4 2 2
#>   ..- attr(*, "label")= chr "relationship to elder"
#>   ..- attr(*, "labels")= Named num  1 2 3 4 5 6 7 8
#>   .. ..- attr(*, "names")= chr  "spouse/partner" "child" "sibling" "daughter or son -in-law" ...
#>  $ e16sex  : num  2 2 2 2 2 2 1 2 2 2
#>   ..- attr(*, "label")= chr "elder's gender"
#>   ..- attr(*, "labels")= Named num  1 2
#>   .. ..- attr(*, "names")= chr  "male" "female"
#>  $ e17age  : num  83 88 82 67 84 85 74 87 79 83
#>   ..- attr(*, "label")= chr "elder' age"
#>  $ e42dep  : num  3 3 3 4 4 4 4 4 4 4
#>   ..- attr(*, "label")= chr "elder's dependency"
#>   ..- attr(*, "labels")= Named num  1 2 3 4
#>   .. ..- attr(*, "names")= chr  "independent" "slightly dependent" "moderately dependent" "severely dependent"

bind_rows(x1, x1) %>% frq(e42dep)
#> 
#> # e42dep <numeric> 
#> # total N=20  valid N=20  mean=3.70  sd=0.47
#>  
#>   val frq raw.prc valid.prc cum.prc
#>     3   6      30        30      30
#>     4  14      70        70     100
#>  <NA>   0       0        NA      NA

add_rows(x1, x1) %>% frq(e42dep)
#> 
#> # elder's dependency (e42dep) <numeric> 
#> # total N=20  valid N=20  mean=3.70  sd=0.47
#>  
#>  val                label frq raw.prc valid.prc cum.prc
#>    1          independent   0       0         0       0
#>    2   slightly dependent   0       0         0       0
#>    3 moderately dependent   6      30        30      30
#>    4   severely dependent  14      70        70     100
#>   NA                   NA   0       0        NA      NA
Sweeper answered 24/9, 2018 at 11:28 Comment(0)
V
0

You can insert them at the end of your original database (db1) adding the number of rows of your second database. The columns that are not included in db2 will show NA values.

db1[nrow(db1)+1:nrow(db1)+nrow(db2), names(db2)] <- db2

Valladares answered 25/5, 2022 at 21:37 Comment(0)
G
0

Unfortunately, the best answer data.table::rbindlist(x, fill=TRUE) didn't work for me. Instead, it corrupted my data, which I found out only during analysis when some rows that had value before merging lost their value after merging.

Other solutions using merge or rbind also didn't work due to a differing number of columns.

So I had to develop my own special solution. Its two short functions with base! Nothing else required.

The first issue is that we can't subset data.frame with non-existing columns. But if we solve that, we can just subset those data.frames and rbind the result.

subset_missing = function(x, select){
    y = lapply(select, \(y) if(y %in% names(x)) x[[y]] else NA)
    data.frame(y) |> setNames(select)
    }

This is a cleaned solution from another StackOverflow answer.

Once we have that, the rest is easy!

merge_df = function(x){
    nm = lapply(x, colnames) |> unlist() |> unique()
    y = lapply(x, subset_missing, select = nm)
    do.call(what = rbind, y)
    }

Now, lets test it:

df1 = data.frame(a = c(1:5), b = c(6:10))
df2 = data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])

merge_df(list(df1, df2))

#     a  b    c
# 1   1  6 <NA>
# 2   2  7 <NA>
# 3   3  8 <NA>
# 4   4  9 <NA>
# 5   5 10 <NA>
# 6  11 16    A
# 7  12 17    B
# 8  13 18    C
# 9  14 19    D
# 10 15 20    E
Gaia answered 1/2 at 3:3 Comment(0)
S
-1
rbind.ordered=function(x,y){

  diffCol = setdiff(colnames(x),colnames(y))
  if (length(diffCol)>0){
    cols=colnames(y)
    for (i in 1:length(diffCol)) y=cbind(y,NA)
    colnames(y)=c(cols,diffCol)
  }

  diffCol = setdiff(colnames(y),colnames(x))
  if (length(diffCol)>0){
    cols=colnames(x)
    for (i in 1:length(diffCol)) x=cbind(x,NA)
    colnames(x)=c(cols,diffCol)
  }
  return(rbind(x, y[, colnames(x)]))
}
Storebought answered 24/7, 2012 at 11:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.