R dplyr operate on a column known only by its string name
Asked Answered
S

2

3

I am wrestling with programming using dplyr in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.

My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.

For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.

Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName, which follows a predefined pattern based on the value of pctCutoff.

userInput_prob1 <- 0.95
userInput_prob2 <- 0.9

# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){

  # Define new column name to hold the MPG percentile cutoff.
  probColName <- paste0('P', pctCutoff*100)

  # Compute the MPG percentile cutoff by number of gears.
  MPGCutoffs <- mtcars %>%
    dplyr::group_by( gear ) %>%
    dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )

  # Filter mtcars with only MPG values above cutoffs.
  output <- mtcars %>%
    dplyr::left_join( MPGCutoffs, by='gear' ) %>%
    dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck

  # Return filtered data.
  return(output)
}

best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )

The dplyr::filter() statement is what I can't get to run properly. I've tried:

dplyr::filter( mpg > probColName ) - No error, but no rows returned.

dplyr::filter( mpg > !!probColName ) - No error, but no rows returned.

I've also seen examples where I could pass something like quo(P95) to the function and then unquote it in the call to dplyr::filter(); I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter() fails because the column created is named P90 and not P95.

Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.

Sfumato answered 19/9, 2017 at 21:39 Comment(0)
P
10

If you have a column name in a string (aka character vector) and you want to use it with tidyeval, then you can covert it with rlang::sym(). Just change

dplyr::filter( mpg > !!rlang::sym(probColName) )

and it should work. This is taken from the recommendation at this github issue: https://github.com/tidyverse/rlang/issues/116

It's still fine to use

dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )

because when dynamically setting a parameter name, you just need the string and not an unqouted symbol.

Pentatomic answered 19/9, 2017 at 21:44 Comment(6)
This is exactly what I was looking to do, thanks @MrFlick. Question: when I evaluate str(rlang::sym(probColName)) it returns symbol 95 (or other number according to the value of pctCutoff. What does the !! operator do in the case of a symbol? Is it converting it to a bare unquoted variable name?Sfumato
A symbol is like a quosure without the environment. The !! Basically teats both symbols and quosures the same. A symbol is a bare unquoted variable name. That's what one used before tidyeval changed the paradigm.Pentatomic
It is interesting to me that the unquote operator !! is still required on a symbol, if a symbol is just a bare unquoted variable name, which is what dplyr verbs expect using NSE. Anyway, this solves a problem I've had for a while now, and with the updated quosure handling with the latest release of dplyr, will enable much more streamlined handling of generic arguments to functions that employ dplyr verbs. Thanks for the insight!Sfumato
It really has more to do with when you want the symbol to be evaluated and when you want it unevaluated. The !! gives you more control over that.Pentatomic
Final follow-up: would there be any difference in using as.name() vs. rlang::sym() to accomplish this? I'm wondering if there may be a nuance related to the environment in which the expression is evaluated; I confess I still don't have my head wrapped around how environments come into play with quosures.Sfumato
If you check the ?sym help page, you can see there is a difference in how they handle encoding, but otherwise they are basically the same.Pentatomic
S
3

Here's an alternate solution from Hadley's comment in the post referred to in MrFlick's answer (https://github.com/tidyverse/rlang/issues/116). Using as.name() from base R takes the place of rlang::sym(), and you still do need to unquote it. That is, the following also works:

dplyr::filter( mpg > !!as.name(probColName) )

Sfumato answered 19/9, 2017 at 22:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.