How to drop columns by name in a data frame
Asked Answered
R

12

376

I have a large data set and I would like to read specific columns or drop all the others.

data <- read.dta("file.dta")

I select the columns that I'm not interested in:

var.out <- names(data)[!names(data) %in% c("iden", "name", "x_serv", "m_serv")]

and than I'd like to do something like:

for(i in 1:length(var.out)) {
   paste("data$", var.out[i], sep="") <- NULL
}

to drop all the unwanted columns. Is this the optimal solution?

Ropeway answered 8/3, 2011 at 14:56 Comment(2)
sleeping over the problem, i was thinking that subset(data, select=c(...)) helps in my case for dropping vars. the question though was mainly about the paste("data$",var.out[i],sep="") part to access columns of interest inside the loop. how can i paste or somehow compose a column name? Thanks to everyone for your attention and your helpRopeway
Possible duplicate of Drop columns in R data frameVtarj
L
477

You should use either indexing or the subset function. For example :

R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
  x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8

Then you can use the which function and the - operator in column indexation :

R> df[ , -which(names(df) %in% c("z","u"))]
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Or, much simpler, use the select argument of the subset function : you can then use the - operator directly on a vector of column names, and you can even omit the quotes around the names !

R> subset(df, select=-c(z,u))
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Note that you can also select the columns you want instead of dropping the others :

R> df[ , c("x","y")]
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

R> subset(df, select=c(x,y))
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
Lineage answered 8/3, 2011 at 15:3 Comment(9)
the select argument of the subset function did the job perfectly! Thank you juba!Ropeway
which is not neccessary, see Ista's answer. But the subset with - is nice! Didn't know that!Duhl
subset looks good, but the way it silently drops missing values seems pretty dangerous to me.Soissons
subset is indeed very convenient, but remember to avoid using it unless you're using R interactively. See the Warning in the function's documentation and this SO question for more.Glanders
If there are several duplicate colnames in your data, the which method works fine, while others would only output the only first col.Recoverable
What could cause this > subset(testData5, select= -c("myCol")) Error in -c("myCol") : invalid argument to unary operator?Knee
"you can even omit the quotes around the names !", you actually have to omit the quotes, otherwise you'll get invalid argument to unary operator. If you have certain characters (for example "-") in your names you cannot use this method at all since dropping quotes will cause R to be unable to properly parse your code.Adenoidal
.@Lineage - What if I want to drop column where the column name starts with -?Niemann
@ChetanArvindPatil - backtick quotes does it - dat <- data.frame("-ab"=1:3, "other"=2:4, check.names=FALSE); subset(dat, select=-c(`-ab`))Triploid
S
150

Do not use -which() for this, it is extremely dangerous. Consider:

dat <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
dat[ , -which(names(dat) %in% c("z","u"))] ## works as expected
dat[ , -which(names(dat) %in% c("foo","bar"))] ## deletes all columns! Probably not what you wanted...

Instead use subset or the ! function:

dat[ , !names(dat) %in% c("z","u")] ## works as expected
dat[ , !names(dat) %in% c("foo","bar")] ## returns the un-altered data.frame. Probably what you want

I have learned this from painful experience. Do not overuse which()!

Sputter answered 8/3, 2011 at 18:12 Comment(1)
setdiff is also useful: setdiff(names(dat), c("foo", "bar"))Saffier
B
57

First, you can use direct indexing (with booleans vectors) instead of re-accessing column names if you are working with the same data frame; it will be safer as pointed out by Ista, and quicker to write and to execute. So what you will only need is:

var.out.bool <- !names(data) %in% c("iden", "name", "x_serv", "m_serv")

and then, simply reassign data:

data <- data[,var.out.bool] # or...
data <- data[,var.out.bool, drop = FALSE] # You will need this option to avoid the conversion to an atomic vector if there is only one column left

Second, quicker to write, you can directly assign NULL to the columns you want to remove:

data[c("iden", "name", "x_serv", "m_serv")] <- list(NULL) # You need list() to respect the target structure.

Finally, you can use subset(), but it cannot really be used in the code (even the help file warns about it). Specifically, a problem to me is that if you want to directly use the drop feature of susbset() you need to write without quotes the expression corresponding to the column names:

subset( data, select = -c("iden", "name", "x_serv", "m_serv") ) # WILL NOT WORK
subset( data, select = -c(iden, name, x_serv, m_serv) ) # WILL

As a bonus, here is small benchmark of the different options, that clearly shows that subset is the slower, and that the first, reassigning method is the faster:

                                        re_assign(dtest, drop_vec)  46.719  52.5655  54.6460  59.0400  1347.331
                                      null_assign(dtest, drop_vec)  74.593  83.0585  86.2025  94.0035  1476.150
               subset(dtest, select = !names(dtest) %in% drop_vec) 106.280 115.4810 120.3435 131.4665 65133.780
 subset(dtest, select = names(dtest)[!names(dtest) %in% drop_vec]) 108.611 119.4830 124.0865 135.4270  1599.577
                                  subset(dtest, select = -c(x, y)) 102.026 111.2680 115.7035 126.2320  1484.174

Microbench graph

Code is below :

dtest <- data.frame(x=1:5, y=2:6, z = 3:7)
drop_vec <- c("x", "y")

null_assign <- function(df, names) {
  df[names] <- list(NULL)
  df
}

re_assign <- function(df, drop) {
  df <- df [, ! names(df) %in% drop, drop = FALSE]
  df
}

res <- microbenchmark(
  re_assign(dtest,drop_vec),
  null_assign(dtest,drop_vec),
  subset(dtest, select = ! names(dtest) %in% drop_vec),
  subset(dtest, select = names(dtest)[! names(dtest) %in% drop_vec]),
  subset(dtest, select = -c(x, y) ),
times=5000)

plt <- ggplot2::qplot(y=time, data=res[res$time < 1000000,], colour=expr)
plt <- plt + ggplot2::scale_y_log10() + 
  ggplot2::labs(colour = "expression") + 
  ggplot2::scale_color_discrete(labels = c("re_assign", "null_assign", "subset_bool", "subset_names", "subset_drop")) +
  ggplot2::theme_bw(base_size=16)
print(plt)
Bagatelle answered 22/7, 2013 at 20:6 Comment(3)
I like your second alternative using NULL, but why when you put more than two names is necessary to assign it with list(NULL) ? I'm only curious to know how it works, because I tried with only one name and I don't need list()Galvani
@DarwinPC Yes. If you access directly one vector element (with $ or [[), using <- list(NULL) will actually lead to wrong results. If you access a subset of the dataframe with one or multiple columns, <- list(NULL) is the way to go, even if it is not needed for a one column dataframe (because df['myColumns'] will get casted to a vector if needed).Thanasi
Has this behaviour changed? I get identical results with both NULL and list(NULL).Aquarius
G
34

You can also try the dplyr package:

R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
  x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8
R> library(dplyr)
R> dplyr::select(df2, -c(x, y))  # remove columns x and y
  z u
1 3 4
2 4 5
3 5 6
4 6 7
5 7 8
Gony answered 12/6, 2015 at 18:15 Comment(2)
Using dplyr::select(df2, -one_of(c('x','y'))) will still work (with a warning) even if some of the named columns don't existAttwood
This was exactly the solution I was searching for @divibisan, thanks!Campball
W
19

Here's a quick solution for this. Say, you have a data frame X with three columns A, B and C:

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6))
> X
  A B C
1 1 3 5
2 2 4 6

If I want to remove a column, say B, just use grep on colnames to get the column index, which you can then use to omit the column.

> X<-X[,-grep("B",colnames(X))]

Your new X data frame would look like the following (this time without the B column):

> X
  A C
1 1 5
2 2 6

The beauty of grep is that you can specify multiple columns that match the regular expression. If I had X with five columns (A,B,C,D,E):

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
  A B C D  E
1 1 3 5 7  9
2 2 4 6 8 10

Take out columns B and D:

> X<-X[,-grep("B|D",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10

EDIT: Considering the grepl suggestion of Matthew Lundberg in the comments below:

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
  A B C D  E
1 1 3 5 7  9
2 2 4 6 8 10
> X<-X[,!grepl("B|D",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10

If I try to drop a column that's non-existent,nothing should happen:

> X<-X[,!grepl("G",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10
Wie answered 25/12, 2015 at 1:10 Comment(1)
X[,-grep("B",colnames(X))] will return no columns in the case where no column name contains B, rather than returning all the columns as would be desired. Consider with X <- iris for an example. This is the problem with using negative indices with computed values. Consider grepl instead.Imperforate
K
11
df2 <- df[!names(df) %in% c("c1", "c2")]
Kirshbaum answered 3/12, 2015 at 5:59 Comment(1)
Short and slick, love it!Revoice
B
7

I tried to delete a column while using the package data.table and got an unexpected result. I kind of think the following might be worth posting. Just a little cautionary note.

[ Edited by Matthew ... ]

DF = read.table(text = "
     fruit state grade y1980 y1990 y2000
     apples Ohio   aa    500   100   55
     apples Ohio   bb      0     0   44
     apples Ohio   cc    700     0   33
     apples Ohio   dd    300    50   66
", sep = "", header = TRUE, stringsAsFactors = FALSE)

DF[ , !names(DF) %in% c("grade")]   # all columns other than 'grade'
   fruit state y1980 y1990 y2000
1 apples  Ohio   500   100    55
2 apples  Ohio     0     0    44
3 apples  Ohio   700     0    33
4 apples  Ohio   300    50    66

library('data.table')
DT = as.data.table(DF)

DT[ , !names(dat4) %in% c("grade")]    # not expected !! not the same as DF !!
[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

DT[ , !names(DT) %in% c("grade"), with=FALSE]    # that's better
    fruit state y1980 y1990 y2000
1: apples  Ohio   500   100    55
2: apples  Ohio     0     0    44
3: apples  Ohio   700     0    33
4: apples  Ohio   300    50    66

Basically, the syntax for data.table is NOT exactly the same as data.frame. There are in fact lots of differences, see FAQ 1.1 and FAQ 2.17. You have been warned!

Baez answered 4/1, 2013 at 1:2 Comment(2)
Or you could use DT[,var.out := NULL] to delete the columns you wish to do so.Mastitis
The subset(x, select=...) method works for both data.frame and data.table classesRutharuthann
R
6

I changed the code to:

# read data
dat<-read.dta("file.dta")

# vars to delete
var.in<-c("iden", "name", "x_serv", "m_serv")

# what I'm keeping
var.out<-setdiff(names(dat),var.in)

# keep only the ones I want       
dat <- dat[var.out]

Anyway, juba's answer is the best solution to my problem!

Ropeway answered 9/3, 2011 at 9:24 Comment(2)
Why do you want to do this in a loop? The answers juba's answer shows you how to do it in one step. Why make it more complicated?Sputter
of course i use the select argument of the subset function in my code. i just wanted to see how i could access arbitrary columns in a loop in case i wanted to do something else than just dropping the column. the original data set has about 1200 vars and i'm only interested in using 4 of them without knowing where exactly they are.Ropeway
E
4

If you know exactly the names of the columns in original dataframe called "df":

cols_to_drop <- c("A", "B", "C")
df_clean = df[,!(names(df) %in% cols_to_drop)]

Src: https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html

Eximious answered 11/11, 2021 at 15:29 Comment(0)
B
2

Here is another solution that may be helpful to others. The code below selects a small number of rows and columns from a large data set. The columns are selected as in one of juba's answers except that I use a paste function to select a set of columns with names that are numbered sequentially:

df = read.table(text = "

state county city  region  mmatrix  X1 X2 X3    A1     A2     A3      B1     B2     B3      C1      C2      C3

  1      1     1      1     111010   1  0  0     2     20    200       4      8     12      NA      NA      NA
  1      2     1      1     111010   1  0  0     4     NA    400       5      9     NA      NA      NA      NA
  1      1     2      1     111010   1  0  0     6     60     NA      NA     10     14      NA      NA      NA
  1      2     2      1     111010   1  0  0    NA     80    800       7     11     15      NA      NA      NA

  1      1     3      2     111010   0  1  0     1      2      1       2      2      2      10      20      30
  1      2     3      2     111010   0  1  0     2     NA      1       2      2     NA      40      50      NA
  1      1     4      2     111010   0  1  0     1      1     NA      NA      2      2      70      80      90
  1      2     4      2     111010   0  1  0    NA      2      1       2      2     10     100     110     120

  1      1     1      3     010010   0  0  1    10     20     10     200    200    200       1       2       3
  1      2     1      3     001000   0  0  1    20     NA     10     200    200    200       4       5       9
  1      1     2      3     101000   0  0  1    10     10     NA     200    200    200       7       8      NA
  1      2     2      3     011010   0  0  1    NA     20     10     200    200    200      10      11      12

", sep = "", header = TRUE, stringsAsFactors = FALSE)
df

df2 <- df[df$region == 2, names(df) %in% c(paste("C", seq_along(1:3), sep=''))]
df2

#    C1  C2  C3
# 5  10  20  30
# 6  40  50  NA
# 7  70  80  90
# 8 100 110 120
Baez answered 12/2, 2013 at 23:50 Comment(0)
M
-1

I can´t answer your question in the comments due to low reputation score.

The next code will give you an error because the paste function return a character string

for(i in 1:length(var.out)) {
   paste("data$", var.out[i], sep="") <- NULL
}

Here is a possible solution:

for(i in 1:length(var.out)) {

  text_to_source <- paste0 ("data$", var.out[i], "<- NULL") # Write a line of your
                                                  # code like a character string
  eval (parse (text=text_to_source)) # Source a text that contains a code
}

or just do:

for(i in 1:length(var.out)) {
  data[var.out[i]] <- NULL
}
Merit answered 7/4, 2014 at 13:9 Comment(0)
A
-6
df = mtcars 
remove vs and am because they are categorical. In the dataset vs is in column number 8, am is in column number 9

dfnum = df[,-c(8,9)]

Artful answered 14/6, 2020 at 13:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.