Conditionally selecting columns in dplyr where certain proportion of values is NA
Asked Answered
C

5

22

Data

I'm working with a data set resembling the data.frame generated below:

set.seed(1)
dta <- data.frame(observation = 1:20,
                  valueA = runif(n = 20),
                  valueB = runif(n = 20),
                  valueC = runif(n = 20),
                  valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA

The columns have NA values with the last column having more than 60% of observations NAs.

> sapply(dta, function(x) {table(is.na(x))})
$observation

FALSE 
   20 

$valueA

FALSE 
   20 

$valueB

FALSE  TRUE 
   16     4 

$valueC

FALSE  TRUE 
   11     9 

$valueD

FALSE  TRUE 
    6    14 

Problem

I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.

Attempts

This can be easily done in base. For example to select columns with less than 50% NAs I can do:

dta[, colSums(is.na(dta)) < nrow(dta) / 2]

which produces:

> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
  observation    valueA    valueB    valueC
1           1 0.2655087 0.9347052 0.8209463
2           2 0.3721239        NA        NA

Task

I'm interested in achieving the same flexibility in dplyr pipe line:

Vectorize(require)(package = c("dplyr",         # Data manipulation
                               "magrittr"),     # Reverse pipe

char = TRUE)

dta %<>%
  # Some transformations I'm doing on the data
  mutate_each(funs(as.numeric)) %>% 
  # I want my select to take place here
Chowder answered 18/1, 2016 at 10:12 Comment(6)
You can use Filter i.e. Filter(function(x) sum(is.na(x)) < length(x)/2, dta)Bezonian
@Bezonian As always, thanks for the helpful contribution. I was just wondering, isn't the filter supposed to be dropping the observations? I'm interested in removing columns not rows.Chowder
It is removing the columns i.e. Filter with capital FBezonian
@Bezonian Now, I got you ?Filter != ?filter :)Chowder
@Bezonian One more question with respect to the suggest Filter solution, I see that you are passing the dta object, on my real data I'm applying some transformations to the data (like gather and spread) so in effect the object I'm working on does not correspond to the initial dta frame. This is why I added this mutate_each(funs(as.numeric)) %>% in my example to indicate that I'm working on a transformed dta. In effect, I don't really have dta to pass on, just a transformed data.frame after applying a couple of pipes.Chowder
I added a solution with summarise_each. Perhaps it helps you.Bezonian
D
23

Like this perhaps?

dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
#  observation    valueA    valueB    valueC
#1           1 0.2655087 0.9347052 0.8209463
#2           2 0.3721239        NA        NA
#3           3 0.5728534        NA        NA
#4           4 0.9082078        NA        NA
#5           5 0.2016819        NA        NA
#6           6 0.8983897 0.3861141        NA

Updated with colMeans instead of colSums which means you don't need to divide by the number of rows any more.

And, just for the record, in base R you could also use colMeans:

dta[,colMeans(is.na(dta)) < 0.5]
Discriminant answered 18/1, 2016 at 10:19 Comment(0)
S
15

Update for 2020 perhaps, now that dplyr reached 1.0.0, which incorporates where():

dta %>% select(where(function(x) sum(is.na(x)) / length(x) < 0.5))
Sweat answered 21/1, 2021 at 21:44 Comment(0)
S
9

I think this does the job:

dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head() 


 observation    valueA    valueB    valueC
  1           0.2655087 0.9347052 0.8209463
  2           0.3721239        NA        NA
  3           0.5728534        NA        NA
  4           0.9082078        NA        NA
  5           0.2016819        NA        NA
  6           0.8983897 0.3861141        NA

`

Stumpf answered 12/6, 2018 at 10:58 Comment(0)
B
5

We can use extract from magrittr after getting a logical vector with summarise_each/unlist

library(magrittr)
library(dplyr)
dta %>% 
    summarise_each(funs(sum(is.na(.)) < n()/2)) %>% 
    unlist() %>%
    extract(dta,.)

Or use Filter from base R

Filter(function(x) sum(is.na(x)) < length(x)/2, dta)

Or a slightly compact option is

Filter(function(x) mean(is.na(x)) < 0.5, dta)
Bezonian answered 18/1, 2016 at 10:20 Comment(8)
I kind of fail to understand this whole question. Seems like dta[colMeans(is.na(dta)) < .5] is the most concise/vectorized/simple/readable solution, what's the point in all these Filter or dplyr/magrittr stuff? I'm probably getting old.Straighten
@DavidArenburg You do understand that colMeans, colSums etc convert the data to matrix. It may not be that efficient (may be I am wrong). But, I like Filter because it adds an exotic touch to it (just like filter in this question. Other than that, magrittr stuff is only for dplyr/magritr fans (my humble opinion).Bezonian
Actually is.na already converts to a matrix. col* functions already operate on the matrix. Thus we could do dta[colMeans(sapply(dta, is.na)) < .5] if you like and its still very readable. And in that linked question, filter is actually the most concise/vectorized/simple solution, rather an exotic one. But I guess you are not the address here for my grange.Straighten
@DavidArenburg The question is about delivering the transformation within the dplyr pipeline. I reckon that base solution may prove more efficient in this case but my ambition was to undertake this transformation within the dplyr pipe line. This mostly motivated by personal preference to undertake this transformation along other ones within the dplyr scope, which mostly results in a code presentation and R script that suits me better. I agree that base could prove more efficient but. I also like @akrun's idea of adding the exotic touch to the solution.Chowder
@DavidArenburg, the dplyr solution dta %>% select(which(colMeans(is.na(.)) < 0.5)) is the same as what you describe as most vectorized/concise/... approach except for a call to which - are you really worried about that? To me, it's perfectly understandable that OP is asking specifically for a "piped" answer since this fits their workflow. It's not about seeking the maximum performance. And even if they were, I reckon it's rather unlikely that this column-subsetting operation would be an actual performance bottleneck in real world programs.Discriminant
@docendodiscimus Exactly - it's the same solution just using an unnecessary/irrelevant calls to an additional packages/functions (IMO). That reminds me of this. But, as I said - I guess I'm just getting old.Straighten
@DavidArenburg, you still don't understand it (or maybe you don't want to?) It's about workflow and they're specifically using dplyr anyway. In that sense, using base for this subset would be an interruptionDiscriminant
@DavidArenburg I can also add that this transformation is part of much longer workflow, including charting at the end. In effect those transformations are only applied for the purpose of generating a specific change, so leaving on side performance issue, it makes sense for the code to read more less on the lines take master data -> do a few things on it -> generate chart. It simply makes a nice, I would day more concise read. As by teaching, we learn rightfully pointed out is mostly about the workflow than anything else. In my defence, I would say that to an extent it's question of taste.Chowder
J
0

With purrr::keep

library(dplyr); library(purrr)
dta %>% keep(~ sum(is.na(.))/length(.) < 0.5)
   observation     valueA     valueB     valueC
1            1 0.26550866 0.93470523 0.82094629
2            2 0.37212390         NA         NA
3            3 0.57285336         NA         NA
4            4 0.90820779         NA         NA
5            5 0.20168193         NA         NA
6            6 0.89838968 0.38611409         NA
7            7 0.94467527 0.01339033         NA
8            8 0.66079779 0.38238796         NA
9            9 0.62911404 0.86969085         NA
10          10 0.06178627 0.34034900         NA
11          11 0.20597457 0.48208012 0.47761962
12          12 0.17655675 0.59956583 0.86120948
13          13 0.68702285 0.49354131 0.43809711
14          14 0.38410372 0.18621760 0.24479728
15          15 0.76984142 0.82737332 0.07067905
16          16 0.49769924 0.66846674 0.09946616
17          17 0.71761851 0.79423986 0.31627171
18          18 0.99190609 0.10794363 0.51863426
19          19 0.38003518 0.72371095 0.66200508
20          20 0.77744522 0.41127443 0.40683019
Jone answered 26/10, 2022 at 6:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.