R: Remove multiple empty columns of character variables
Asked Answered
F

9

27

I have a data frame where all the variables are of character type. Many of the columns are completely empty, i.e. only the variable headers are there, but no values. Is there any way to subset out the empty columns?

Footcloth answered 16/7, 2013 at 9:21 Comment(1)
Btw, by empty, do you mean NA or ""?Antecedent
M
29

If your empty columns are really empty character columns, something like the following should work. It will need to be modified if your "empty" character columns include, say, spaces.

Sample data:

mydf <- data.frame(
  A = c("a", "b"),
  B = c("y", ""),
  C = c("", ""),
  D = c("", ""),
  E = c("", "z")
)
mydf
#   A B C D E
# 1 a y      
# 2 b       z

Identifying and removing the "empty" columns.

mydf[!sapply(mydf, function(x) all(x == ""))]
#   A B E
# 1 a y  
# 2 b   z

Alternatively, as recommended by @Roland:

> mydf[, colSums(mydf != "") != 0]
  A B E
1 a y  
2 b   z
Mastic answered 16/7, 2013 at 9:25 Comment(6)
I just recommend to extend the condition in case there are NA values instead of "": all(x == "" || is.na(x)). But it will loose a bit of the elegance though :)Icy
mydf[,colSums(mydf!="")!=0] might be faster.Complexion
@Tomas, my guess here was that some data got read in incorrectly, leading to empty character columns, and NA is different than a read error, but good suggestion anyway.Mastic
Hi Ananda... Actually, the reading-in is correct. The external data is in the form of many blocks appended together. I have to parse out the individual blocks and throw out the irrelevant columns.Footcloth
@AnandaMahto Maybe mydf[,colSums(mydf!="")>0] is more to your taste?Complexion
@user702432, Cool. Glad it worked out for you. In the future, please do try to make a reproducible example, as I have done in this answer.Mastic
A
14

If you're talking about columns where all values are NA, use remove_empty("cols") from the janitor package.

If you have character vectors where every value is the empty string "", you can first convert those values to NA throughout your data.frame with na_if from the dplyr package:

dat <- data.frame(
  x = c("a", "b", "c"),
  y = c("", "", ""),
  z = c(NA, NA, NA),
  stringsAsFactors = FALSE
)

dat
#>   x y  z
#> 1 a   NA
#> 2 b   NA
#> 3 c   NA

library(dplyr)
library(janitor)

dat %>%
  mutate_all(funs(na_if(., ""))) %>%
  remove_empty("cols")
#>   x
#> 1 a
#> 2 b
#> 3 c
Allx answered 27/1, 2017 at 13:23 Comment(2)
Neat solution! But depreciated. Package says: This function is deprecated, use remove_empty("cols") instead.Celebrity
@radek good catch, thanks. I updated the answer accordingly.Allx
A
13

You can do either of the following:

emptycols <- sapply(df, function (k) all(is.na(k)))
df <- df[!emptycols]

or:

emptycols <- colSums(is.na(df)) == nrow(df)
df <- df[!emptycols]

If by empty you mean they are "", the second approach can be adapted like so:

emptycols <- colSums(df == "") == nrow(df)
Antecedent answered 16/7, 2013 at 9:26 Comment(3)
This is throwing an error. I think is.na() works only on numeric variables.Footcloth
user702432, the error thrown is not because of what you're saying, is.na will work just fine on character vectors as well. It's because he's passing an anonymous function to another. @asb, I think you can't do that. you'll have to do sapply(df, function(x) all(is.na(x))).Lan
Oversight. Just fixed it.Antecedent
M
7

I have a similar situation -- I'm working with a large public records database but when I whittle it down to just the date range and category that I need, there are a ton of columns that aren't in use. Some are blank and some are NA.

The selected answer: https://mcmap.net/q/496473/-r-remove-multiple-empty-columns-of-character-variables didn't work for me, but this did:

df[!sapply(df, function (x) all(is.na(x) | x == ""))]
Myrica answered 12/5, 2016 at 5:19 Comment(0)
I
4

It depends what you mean by empty: Is it NA or "", or can it even be " "? Something like this might work:

df[,!apply(df, 2, function(x) all(gsub(" ", "", x)=="", na.rm=TRUE))]
Ineradicable answered 16/7, 2013 at 9:35 Comment(0)
B
4

This can also be done by dplyr and select_if

`select_if(df,function(x){any(!is.na(x))})`

or with is.null() or x=="" depending on how empty values are defined in your data.

Buckram answered 11/10, 2019 at 16:49 Comment(0)
H
3

A simple solution using the purrr package:

purrr::discard(my_data_frame, ~all(is.na(.)))

Hosbein answered 30/12, 2019 at 10:51 Comment(0)
P
2

Here is something that can be modified to exclude columns containing any variables specied.

newdf= df[, apply(df, 2, function(x) !any({is.na(x) | x== "" | 
x== "-4"} ) )] 
Pterous answered 28/7, 2018 at 5:44 Comment(0)
I
-1

If you know the column indices, you can use

df[,-c(3, 5, 7)]

This will omit columns 3, 5, 7.

Icy answered 16/7, 2013 at 9:23 Comment(1)
Unfortunately I don't. There are 400+ variables, and the empty columns are all randomly placed.Footcloth

© 2022 - 2024 — McMap. All rights reserved.