How can I compare column names of two separate data frames in R?
Asked Answered
D

4

5

I have 2 data frames in R with epigenetic data. To use one of them as a train set and the other as a test set in the glmnet package, the column number if them have to match. As both of the data frames contain more than 800000 columns, I'm looking for a way to compare the names columns of the 2 data frames so that I can delete the columns that the two don't have in common. So far I just found packages and functions that compare rows of two data frames with each other. As an example, I'm looking for something like this:

df1
participant_code cg123  cg122  cg121  cg120

df2
participant_code cg123  cg122  cg121  cg119

The function would give me then e.g. a table in which it shows me which colnames differ:

colname 5 differs
Distillery answered 5/11, 2020 at 13:23 Comment(0)
C
5

Your are looking for the intersection of column names of two data frames. You can simply use the command intersect to achieve what you want. First you extract the names of both data frames. Then you useintersect. The result of intersect contains the column names that are in either of the two data frames. Use this object to subset of initial data frames and you're done.

# define data frames with dummy data
df1 <- data.frame(participant_code = 1,
                  cg123            = 2,
                  cg122            = 3, 
                  cg121            = 4,
                  cg120            = 5)

df2 <- data.frame(participant_code = 6,
                  cg123            = 7,
                  cg122            = 8, 
                  cg121            = 9,
                  cg119            = 10)

# extract column names of the data frames
cols_df_1 <- names(df1)
cols_df_2 <- names(df2)

# find the intersection of both column name vectors
cols_intersection <- intersect(cols_df_1, cols_df_2)

# subset the initial data frames
df1_sub <- df1[,cols_intersection]
df2_sub <- df2[,cols_intersection]

# print to console and see result
df1_sub
#participant_code cg123 cg122 cg121
#               1     2     3     4

df2_sub
#participant_code cg123 cg122 cg121
#               6     7     8     9
Corposant answered 5/11, 2020 at 13:27 Comment(1)
The code as I wrote it in the answer did not produce an error. You might have just provided data frames to the intersect function. It does, however, only work with vectors. And the vectors should contain the names of the columns of the data frames. I updated the answer to make it more clear and even introduced dummy data frames.Corposant
B
2

You can use intersect to get common columns from both the dataframes.

get_common_cols <- function(df1, df2)  intersect(names(df1), names(df2))

You can pass both the dataframe in a function to get similar columns and use it to subset the dataframes

common_cols <- get_common_cols(data1, data2)
data1 <- data1[, common_cols]
data2 <- data2[, common_cols]
Bangor answered 5/11, 2020 at 13:28 Comment(0)
G
2

This might not work the best for a huge data frame, but I have recently become a fan of compare() from the new waldo package.

This will show an output of differences between the two. Again, might be indecipherable for 800k length vectors, but I thought it was worth pointing out.

library(waldo)

compare(names(df1), names(df2)
Giess answered 5/11, 2020 at 13:36 Comment(0)
S
0

You could try using the inspectdf package. There is also comparedf in the arsenal package.

Swim answered 9/7 at 0:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.