Why is enquo + !! preferable to substitute + eval
Asked Answered
E

4

37

In the following example, why should we favour using f1 over f2? Is it more efficient in some sense? For someone used to base R, it seems more natural to use the "substitute + eval" option.

library(dplyr)

d = data.frame(x = 1:5,
               y = rnorm(5))

# using enquo + !!
f1 = function(mydata, myvar) {
  m = enquo(myvar)
  mydata %>%
    mutate(two_y = 2 * !!m)
}

# using substitute + eval    
f2 = function(mydata, myvar) {
  m = substitute(myvar)
  mydata %>%
    mutate(two_y = 2 * eval(m))
}

all.equal(d %>% f1(y), d %>% f2(y)) # TRUE

In other words, and beyond this particular example, my question is: can I get get away with programming using dplyr NSE functions with good ol' base R like substitute+eval, or do I really need to learn to love all those rlang functions because there is a benefit to it (speed, clarity, compositionality,...)?

Efthim answered 6/4, 2018 at 21:0 Comment(13)
I think the world would be a better place if the dplyr:: ppl would just allow us to pass variable names as character strings, as in the old underscored variants like mutate_(). imo, an even better option would be to have an argument like colnames_as_strings=TRUE for mutate() et al... that would make it straightforward to use dplyr both interactively and in software. But until then, welcome to enquo()/!! hell...Endosmosis
tl;dr: the enquo() strategy really only makes sense if you are deeply committed to being able to pass column names without quotes (unclear to me why that's important but oh well). could be that there's some fundamental reason that requires understanding dplyr's internals to grasp...Endosmosis
@Endosmosis I’ve been told that passing column names as characters is “dangerous and unreliable”, but I’ve never gotten a convincing explanation for why that is except in cases that seem bizarrely rare to me. I suppose if you encounter those edge cases routinely it makes more sense, it’s just weird to me bc I don’t think I ever have.Vento
@Vento yeah i can imagine if one is mixing standard and non-standard evaluation there could be problems -- but ya totally agreed, i remain unconvinced re. the "dangerous and unreliable" bit (in fact i'd say that passing names without quotes is more dangerous + unreliable, as with base::subset()!)Endosmosis
@Endosmosis No that's shit. It doesn't actually solve anything, or make anything easier. Also, look up "stringly typed". You're suggesting to subvert the type system. That's a priori a bad idea.Chief
@KonradRudolph i'm suggesting to allow character-based selection/subsetting in a language whose definition uses that convention...Endosmosis
@KonradRudolph The only thing I feel knowledgeable enough to comment on at this point is that your case maybe isn’t helped by that first sentence.Vento
@Endosmosis You're suggesting to allow strings instead of variables inside expressions (or expressions inside strings? That's even worse). That's an important difference. Nobody is talking about merely selecting columns.Chief
okay one last thought: motivation comes from inability to pass a character vector to group_by(), select(), and mutate_at()/summarize_at(). When colnames aren't (or can't) be known in advance, it can be a pain to write good split-apply-combine functions in dplyr. Sometimes even feels easier to use base::tapply(), precisely because you can specify grouping cols as character strings that you pass as a parameter... In the specific case OP showed, it would of course be terrible if "m" meant mydata$m (or whenever a colname is used on the rhs of = inside a dplyr table func).Endosmosis
(fwiw i love dplyr:: and use it every day -- i just want it to be the best it can be!)Endosmosis
@Endosmosis No, that’s no problem at all. Just use group_by(data, !! var). I honestly fail to see the difficulty. It’s a simple, clean, consistent, yet powerful abstraction. It’s thus diametrically opposite to what tapply etc offer.Chief
@Vento Annoyance got the better of me. But your comment illustrates a permanent problem in this debate: people are paying exclusive attention to tone, rather than contents. Facts don’t seem to matter. I might try to use different words but it wouldn’t change anything: a comment with a technically bad (tried, tested, and found wanting) solution got lots of upvotes. My comment which, besides foul language, offered pointers and factual arguments against it, was disregarded.Chief
@KonradRudolph fwiw I believe you (if for no other reason than I know you know a lot more about this than me). I was merely trying to nudge the tone in a different direction.Vento
O
22

I want to give an answer that is independent of dplyr, because there is a very clear advantage to using enquo over substitute. Both look in the calling environment of a function to identify the expression that was given to that function. The difference is that substitute() does it only once, while !!enquo() will correctly walk up the entire calling stack.

Consider a simple function that uses substitute():

f <- function( myExpr ) {
  eval( substitute(myExpr), list(a=2, b=3) )
}

f(a+b)   # 5
f(a*b)   # 6

This functionality breaks when the call is nested inside another function:

g <- function( myExpr ) {
  val <- f( substitute(myExpr) )
  ## Do some stuff
  val
}

g(a+b)
# myExpr     <-- OOPS

Now consider the same functions re-written using enquo():

library( rlang )

f2 <- function( myExpr ) {
  eval_tidy( enquo(myExpr), list(a=2, b=3) )
}

g2 <- function( myExpr ) {
  val <- f2( !!enquo(myExpr) )
  val
}

g2( a+b )    # 5
g2( b/a )    # 1.5

And that is why enquo() + !! is preferable to substitute() + eval(). dplyr simply takes full advantage of this property to build a coherent set of NSE functions.

UPDATE: rlang 0.4.0 introduced a new operator {{ (pronounced "curly curly"), which is effectively a short hand for !!enquo(). This allows us to simplify the definition of g2 to

g2 <- function( myExpr ) {
  val <- f2( {{myExpr}} )
  val
}
Outsmart answered 8/11, 2018 at 20:41 Comment(1)
Great answer man, this was what I was looking for. Many thanks.Efthim
D
6

enquo() and !! also allows you to program with other dplyr verbs such as group_by and select. I'm not sure if substitute and eval can do that. Take a look at this example where I modify your data frame a little bit

library(dplyr)

set.seed(1234)
d = data.frame(x = c(1, 1, 2, 2, 3),
               y = rnorm(5),
               z = runif(5))

# select, group_by & create a new output name based on input supplied
my_summarise <- function(df, group_var, select_var) {

  group_var <- enquo(group_var)
  select_var <- enquo(select_var)

  # create new name
  mean_name <- paste0("mean_", quo_name(select_var))

  df %>%
    select(!!select_var, !!group_var) %>% 
    group_by(!!group_var) %>%
    summarise(!!mean_name := mean(!!select_var))
}

my_summarise(d, x, z)

# A tibble: 3 x 2
      x mean_z
  <dbl>  <dbl>
1    1.  0.619
2    2.  0.603
3    3.  0.292

Edit: also enquos & !!! make it easier to capture list of variables

# example
grouping_vars <- quos(x, y)
d %>%
  group_by(!!!grouping_vars) %>%
  summarise(mean_z = mean(z))

# A tibble: 5 x 3
# Groups:   x [?]
      x      y mean_z
  <dbl>  <dbl>  <dbl>
1    1. -1.21   0.694
2    1.  0.277  0.545
3    2. -2.35   0.923
4    2.  1.08   0.283
5    3.  0.429  0.292


# in a function
my_summarise2 <- function(df, select_var, ...) {

  group_var <- enquos(...)
  select_var <- enquo(select_var)

  # create new name
  mean_name <- paste0("mean_", quo_name(select_var))

  df %>%
    select(!!select_var, !!!group_var) %>% 
    group_by(!!!group_var) %>%
    summarise(!!mean_name := mean(!!select_var))
}

my_summarise2(d, z, x, y)

# A tibble: 5 x 3
# Groups:   x [?]
      x      y mean_z
  <dbl>  <dbl>  <dbl>
1    1. -1.21   0.694
2    1.  0.277  0.545
3    2. -2.35   0.923
4    2.  1.08   0.283
5    3.  0.429  0.292

Credit: Programming with dplyr

Dawdle answered 6/4, 2018 at 23:55 Comment(3)
Thanks! It would be nice to see if substitute+eval could work in those cases too though. In the end, my question was basically: can I get get away with programming using dplyr NSE functions with good ol' substitute+eval, or do I really need to learn to love all those rlang functions you mentioned because there is a benefit to it?Efthim
@mbiron: I'm curious to see a solution using substitute+eval too. IMO if you're using a lot of tidyverse packages then it's worth to learn about tidyeval as Hadley and other devs are pushing toward that direction. Here is an example parsing input strings into dplyr. Another example using tidyeval in ggplot2Dawdle
@Efthim Of course you can theoretically use eval and substitute here. But the solutions would be painfully complex and complicated. {rlang}’s contribution is to generalise, formalise and simplify the solution by building on existing computer science research.Chief
A
5

Imagine there is a different x you want to multiply:

> x <- 3
> f1(d, !!x)
  x            y two_y
1 1 -2.488894875     6
2 2 -1.133517746     6
3 3 -1.024834108     6
4 4  0.730537366     6
5 5 -1.325431756     6

vs without the !!:

> f1(d, x)
  x            y two_y
1 1 -2.488894875     2
2 2 -1.133517746     4
3 3 -1.024834108     6
4 4  0.730537366     8
5 5 -1.325431756    10

!! gives you more control over scoping than substitute - with substitute you can only get the 2nd way easily.

Athematic answered 6/4, 2018 at 21:45 Comment(1)
I see. It seems related to something that shows up in this blog post: !! deals better with composition of functions that use NSE. Still, the examples seem a bit awkwardEfthim
G
4

To add some nuance, these things are not necessarily that complex in base R.

It is important to remember to use eval.parent() when relevant to evaluate substituted arguments in the right environment, if you use eval.parent() properly the expression in nested calls will find their ways. If you don't you might discover environment hell :).

The base tool box that I use is made of quote(), substitute(), bquote(), as.call(), and do.call() (the latter useful when used with substitute()

Without going into details here is how to solve in base R the cases presented by @Artem and @Tung, without any tidy evaluation, and then the last example, not using quo / enquo, but still benefiting from splicing and unquoting (!!! and !!)

We'll see that splicing and unquoting makes code nicer (but requires functions to support it!), and that in the present cases using quosures doesn't improve things dramatically (but still arguably does).

solving Artem's case with base R

f0 <- function( myExpr ) {
  eval(substitute(myExpr), list(a=2, b=3))
}

g0 <- function( myExpr ) {
  val <- eval.parent(substitute(f0(myExpr)))
  val
}

f0(a+b)
#> [1] 5
g0(a+b)
#> [1] 5

solving Tung's 1st case with base R

my_summarise0 <- function(df, group_var, select_var) {

  group_var  <- substitute(group_var)
  select_var <- substitute(select_var)

  # create new name
  mean_name <- paste0("mean_", as.character(select_var))

  eval.parent(substitute(
  df %>%
    select(select_var, group_var) %>% 
    group_by(group_var) %>%
    summarise(mean_name := mean(select_var))))
}

library(dplyr)
set.seed(1234)
d = data.frame(x = c(1, 1, 2, 2, 3),
               y = rnorm(5),
               z = runif(5))
my_summarise0(d, x, z)
#> # A tibble: 3 x 2
#>       x mean_z
#>   <dbl>  <dbl>
#> 1     1  0.619
#> 2     2  0.603
#> 3     3  0.292

solving Tung's 2nd case with base R

grouping_vars <- c(quote(x), quote(y))
eval(as.call(c(quote(group_by), quote(d), grouping_vars))) %>%
  summarise(mean_z = mean(z))
#> # A tibble: 5 x 3
#> # Groups:   x [3]
#>       x      y mean_z
#>   <dbl>  <dbl>  <dbl>
#> 1     1 -1.21   0.694
#> 2     1  0.277  0.545
#> 3     2 -2.35   0.923
#> 4     2  1.08   0.283
#> 5     3  0.429  0.292

in a function:

my_summarise02 <- function(df, select_var, ...) {

  group_var  <- eval(substitute(alist(...)))
  select_var <- substitute(select_var)

  # create new name
  mean_name <- paste0("mean_", as.character(select_var))

  df %>%
    {eval(as.call(c(quote(select),quote(.), select_var, group_var)))} %>% 
    {eval(as.call(c(quote(group_by),quote(.), group_var)))} %>%
    {eval(bquote(summarise(.,.(mean_name) := mean(.(select_var)))))}
}

my_summarise02(d, z, x, y)
#> # A tibble: 5 x 3
#> # Groups:   x [3]
#>       x      y mean_z
#>   <dbl>  <dbl>  <dbl>
#> 1     1 -1.21   0.694
#> 2     1  0.277  0.545
#> 3     2 -2.35   0.923
#> 4     2  1.08   0.283
#> 5     3  0.429  0.292

solving Tung's 2nd case with base R but using !! and !!!

grouping_vars <- c(quote(x), quote(y))

d %>%
  group_by(!!!grouping_vars) %>%
  summarise(mean_z = mean(z))
#> # A tibble: 5 x 3
#> # Groups:   x [3]
#>       x      y mean_z
#>   <dbl>  <dbl>  <dbl>
#> 1     1 -1.21   0.694
#> 2     1  0.277  0.545
#> 3     2 -2.35   0.923
#> 4     2  1.08   0.283
#> 5     3  0.429  0.292

in a function :

my_summarise03 <- function(df, select_var, ...) {

  group_var  <- eval(substitute(alist(...)))
  select_var <- substitute(select_var)

  # create new name
  mean_name <- paste0("mean_", as.character(select_var))

  df %>%
    select(!!select_var, !!!group_var) %>% 
    group_by(!!!group_var) %>%
    summarise(.,!!mean_name := mean(!!select_var))
}

my_summarise03(d, z, x, y)
#> # A tibble: 5 x 3
#> # Groups:   x [3]
#>       x      y mean_z
#>   <dbl>  <dbl>  <dbl>
#> 1     1 -1.21   0.694
#> 2     1  0.277  0.545
#> 3     2 -2.35   0.923
#> 4     2  1.08   0.283
#> 5     3  0.429  0.292

Grizzled answered 4/10, 2019 at 15:50 Comment(2)
Of course we could also use the *_at() variants, but it's besides the point hereGrizzled
Very clever use of eval.parent()!Outsmart

© 2022 - 2024 — McMap. All rights reserved.