Programmatically calling group_by() on a varying variable
Asked Answered
A

2

7

Using dplyr, I'd like to summarize [sic] by a variable that I can vary (e.g. in a loop or apply-style command).

Typing the names in directly works fine:

library(dplyr)
ChickWeight %>% group_by( Chick, Diet ) %>% summarise( mw = mean( weight ) )

But group_by wasn't written to take a character vector, so passing in results is harder.

v <- "Diet"
ChickWeight %>% group_by( c( "Chick", v ) ) %>% summarise( mw = mean( weight ) )
## Error

I'll post one solution, but curious to see how others have solved this.

Antimagnetic answered 8/2, 2015 at 0:22 Comment(9)
:-) summarize [sic] +1Acceptant
Just do group_by_( c( "Chick", v ) ) instead of group_by( c( "Chick", v ) )....Piffle
@Ari If you use US spelling, why do you use summarise in code?Porett
And of course, if it wasn't possible with dplyr, you could also just do it easily with data.table :) as in library(data.table) ; as.data.table(ChickWeight)[, .(mw = mean(weight)), c("Chick", v)]Piffle
@KonradRudolph One more function call wrapped around things? In deference to Hadley's native ways? Out of habit from older Hadley packages? Dunno. :-)Antimagnetic
@KonradRudolph - I use summarise as well, mainly because there is no summarize_each. One less thing I have to remember.Higinbotham
@Richard The use of UK English in Hadley’s library is an unfortunate (= bad) decision. APIs should be uniform, not personalised. I favour British spelling in all my writing, yet I adhere to the uniform, established, US spelling in my code. It’s very annoying and breaks all kinds of principles of API design when other code breaks that rule (there’s a reason non-English programming languages are usually seen as a failed experiment). As such, I strongly recommend adhering to the US spelling (and the lack of summarize_each is probably an oversight).Porett
@KonradRudolph, there's an issue on github asking for a summarize_each alias.Blameless
@docendodiscimus There are actually at least two pull requests to fix it – I almost added a third this morning, before finding the other two.Porett
M
11

The underscore functions of dplyr could be useful for that:

ChickWeight %>% group_by_( "Chick", v )  %>% summarise( mw = mean( weight ) )

From the new features in dplyr 0.3:

You can now program with dplyr – every function that uses non-standard evaluation (NSE) also has a standard evaluation (SE) twin that ends in _. For example, the SE version of filter() is called filter_(). The SE version of each function has similar arguments, but they must be explicitly “quoted”.

Muricate answered 8/2, 2015 at 0:45 Comment(1)
It should be noted that while this answer was correct years ago, the use of the *_(.) verbs is deprecated, preferring !!sym(arg) or {{ arg }} or all_of(arg) (contextual). See dplyr.tidyverse.org/articles/programming.html.Councilor
A
0

Here's one solution and how I arrived at it.

What does group_by expect?

> group_by
function (x, ..., add = FALSE) 
{
    new_groups <- named_dots(...)

Down the rabbit hole:

> dplyr:::named_dots
function (...) 
{
    auto_name(dots(...))
}
<environment: namespace:dplyr>
> dplyr:::auto_name
function (x) 
{
    names(x) <- auto_names(x)
    x
}
<environment: namespace:dplyr>
> dplyr:::auto_names
function (x) 
{
    nms <- names2(x)
    missing <- nms == ""
    if (all(!missing)) 
        return(nms)
    deparse2 <- function(x) paste(deparse(x, 500L), collapse = "")
    defaults <- vapply(x[missing], deparse2, character(1), USE.NAMES = FALSE)
    nms[missing] <- defaults
    nms
}
<environment: namespace:dplyr>
> dplyr:::names2
function (x) 
{
    names(x) %||% rep("", length(x))
}

Using that information, how to go about crafting a solution?

# Naive solution fails:
ChickWeight %>% do.call( group_by, list( Chick, Diet ) ) %>% summarise( mw = mean( weight ) )

# Slightly cleverer:
do.call( group_by, list( x = ChickWeight, Chick, Diet, add = FALSE ) ) %>% summarise( mw = mean( weight ) )
## But still fails with,
## Error in do.call(group_by, list(x = ChickWeight, Chick, Diet, add = FALSE)) : object 'Chick' not found

The solution lies in quoting the arguments so their evaluation is delayed until they're in the environment that includes the x tbl:

do.call( group_by, list( x = ChickWeight, quote(Chick), quote(Diet), add = FALSE ) ) %>% summarise( mw = mean( weight ) )
## Bingo!
v <- "Diet"
do.call( group_by, list( x = ChickWeight, quote(Chick), substitute( a, list( a = v ) ), add = FALSE ) ) %>% summarise( mw = mean( weight ) )
Antimagnetic answered 8/2, 2015 at 0:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.