Use ddply within a function and include variable of interest as an argument
Asked Answered
S

4

7

I am relatively new to R, and trying to use ddply & summarise from the plyr package. This post almost, but not quite, answers my question. I could use some additional explanation/clarification.

My problem:

I want to create a simple function to summarize descriptive statistics, by group, for a given variable. Unlike the linked post, I would like to include the variable of interest as an argument to the function. As has already been discussed on this site, this works:

require(plyr)

ddply(mtcars, ~ cyl, summarise,
  mean = mean(hp),
  sd   = sd(hp),
  min  = min(hp),
  max  = max(hp)
)

But this doesn't:

descriptives_by_group <- function(dataset, group, x)
{
  ddply(dataset, ~ group, summarise,
    mean = mean(x),
    sd   = sd(x),
    min  = min(x),
    max  = max(x)
  )
}

descriptives_by_group(mtcars, cyl, hp)

Because of the volume of data with which I am working, I would like to be able to have a function that allows me to specify the variable of interest to me as well as the dataset and grouping variable.

I have tried to edit the various solutions found here to address my problem, but I don't understand the code well enough to do it successfully.

The original poster used the following example dataset:

a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")

With the desired output:

  b Ave
1 0 1.5
2 1 3.5

And the solution endorsed by Hadley was:

myFunction <- function(x, y){
NewColName <- "a"
z <- ddply(x, y, .fun = function(xx,col){
                         c(Ave = mean(xx[,col],na.rm=TRUE))}, 
           NewColName)
return(z)
}

Where myFunction(df, sv) returns the desired output.

I tried to break down the code piece-by-piece to see if, by getting a better understanding of the underlying mechanics, I could modify the code to include an argument to the function that would pass to what, in this example, is "NewColName" (the variable you want to get information about). But I am not having any success. My difficulty is that I do not understand what is happening with (xx[,col]). I know that mean(xx[,col]) should be taking the mean of the column with index col for the data frame xx. But I don't understand where the anonymous function is reading those values from.

Could someone please help me parse this? I've wasted hours on a trivial task I could accomplish easily with very repetitive code and/or with subsetting, but I got hung up on trying to make my script more simple and elegant, and on understanding the "whys" of this problem and its solution(s).

PS I have looked into the describeBy function from the psych package, but as far as I can tell, it does not let you specify the variable(s) you want to return values for, and consequently does not solve my problem.

Southerly answered 29/8, 2013 at 16:40 Comment(3)
I"m not sure I understand. ddply accepts a character vector of grouping variables. As in, ddply(data,c('var1','var2'),...).Patsy
Also take a look at colwiseDisillusionize
The problem was getting the third argument to be passed into summarise.Ewen
S
8

I just moved a couple things around in the example function you gave and showed how to get more than one column back out. Does this do what you want?

myFunction2 <- function(x, y, col){
z <- ddply(x, y, .fun = function(xx){
                         c(mean = mean(xx[,col],na.rm=TRUE),
                         max = max(xx[,col],na.rm=TRUE) ) })
return(z)
}

myFunction2(mtcars, "cyl", "hp")
Standpoint answered 29/8, 2013 at 17:59 Comment(0)
E
5

(More of a comment than an answer. I had the same level of difficulty as you when using ddply(...,summarise, ...) inside a function.) This is a base solution that worked the way I expected:

descriptives_by_group <- function(dataset, group, x)
  {aggregate(dataset[[x]], dataset[group], function(x)
      c(  mean = mean(x),
          sd   = sd(x),
          min  = min(x),
          max  = max(x)
         ) )
  }

descriptives_by_group(mtcars, 'cyl', 'hp')
Ewen answered 29/8, 2013 at 18:57 Comment(0)
S
3

Just use as.quoted function. Example below

simple_ddply <- function(dataset_name, variable_name){
    data <- ddply(dataset_name,as.quoted(variable_name), *remaining input)**
Sargeant answered 17/6, 2014 at 11:58 Comment(0)
P
1

With the introduction of quosures in the devel version of dplyr (soon to be released 0.6.0), this becomes a bit more easier

library(dplyr)
descriptives_by_groupN <- function(dataset, group, x) {

   group <- enquo(group)
   x <- enquo(x)

  dataset %>%
         group_by(!!group) %>%
         summarise(Mean = mean(!!x),
                SD = sd(!!x),
                Min = min(!!x),
                Max = max(!!x))
}

descriptives_by_groupN(mtcars, cyl, hp)
# A tibble: 3 × 5
#   cyl      Mean       SD   Min   Max
#  <dbl>     <dbl>    <dbl> <dbl> <dbl>
#1     4  82.63636 20.93453    52   113
#2     6 122.28571 24.26049   105   175
#3     8 209.21429 50.97689   150   335

Here, the input arguments are converted to quosures with enquo, and inside the group_by/summarise, unquote the quosures (!! or UQ) to get it evaluated

Pantaloon answered 15/4, 2017 at 4:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.