Summary statistics using ddply

Asked 19/4, 2011 at 10:0 Answered 23/5 at 22:28

I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat.

mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index"
index is factor with 2 levels "Short", "Long"
"metric", "length", "species", "tree" and others are all continuous variables

Function:

summary1 <- function(arg1,arg2) {
    ...

    ss <- ddply(mat, .(index), function(X) data.frame(
        arg1 = as.list(summary(X$arg1)),
        arg2 = as.list(summary(X$arg2)),
        .parallel = FALSE)

    ss
}

I expect the output to look like this after calling summary1("metric","length")

Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max. 

....

Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.

....

At the moment the function does not produce the desired output? What modification should be made here?

Thanks for your help.

Here is a toy example

mat <- data.frame(
    metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
    tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)

Nerine answered 19/4, 2011 at 10:0 Comment(5)

This would be easier to answer if you provided sample data (prefereably with dput). – Priester 19/4, 2011 at 10:11

@Richie- Here is a toy example

mat<-data.frame(metric=rpois(10,10),length=rpois(10,10),species=rpois(10,10),tree=rpois(10,10),index=c(rep("Short",5),rep("Long",5)))

- Thanks – Nerine 19/4, 2011 at 10:23

You could edit question to add sample data instead of writing a comment (I done it for you ;)). – Montespan 19/4, 2011 at 11:19

I would suggest making your function more generalizable by passing in additional parameters for the data.frame and the variable to split by as well. That way your function will work when you need to use it on a data.frame named Mat or MAT or MyOtherData, etc. – Malraux 19/4, 2011 at 13:58

There should be an R generic function for that. Even supporting any number of arguments. Is there such? – Phrygian 28/11, 2012 at 14:39

As Nick wrote in his answer you can't use $ to reference variable passed as character name. When you wrote X$arg1 then R search for column named "arg1" in data.frame X. You can reference to it either by X[,arg1] or X[[arg1]].

And if you want nicely named output I propose below solution:

summary1 <- function(arg1, arg2) {

    ss <- ddply(mat, .(index), function(X) data.frame(
        setNames(
            list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
            c(arg1,arg2)
            )), .parallel = FALSE)

    ss
}
summary1("metric","length")

Output for toy data is:

  index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
1  Long           5              7            10         8.6             10
2 Short           7              7             9         8.8             10
  metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
1          11           9             10            11        10.8             12
2          11           4              9             9         9.0             11
  length.Max.
1          12
2          12

Montespan answered 19/4, 2011 at 11:51 Comment(0)

Is this more like what you want?

summary1 <- function(arg1,arg2) {
ss <- ddply(mat, .(index), function(X){ data.frame(
    arg1 = as.list(summary(X[,arg1])),
    arg2 = as.list(summary(X[,arg2])),
    .parallel = FALSE)})
ss
}

Dropwort answered 19/4, 2011 at 11:2 Comment(0)

As ddply is long outdated now, skimr is a quick way to get grouped summary statistics:

> my_skim <- skim_with(numeric = sfl(median))
> mat %>% group_by(index) %>% my_skim
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             10        
Number of columns          5         
_______________________              
Column type frequency:               
  numeric                  4         
________________________             
Group variables            index     

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
  skim_variable index n_missing complete_rate mean   sd p0 p25 p50 p75 p100 hist  median
1 metric        Long          0             1 10.2 3.70  5   8  11  13   14 ▃▃▁▃▇     11
2 metric        Short         0             1 10.6 3.21  6  10  11  11   15 ▂▁▇▁▂     11
3 length        Long          0             1  9.8 2.05  8   8  10  10   13 ▇▇▁▁▃     10
4 length        Short         0             1  8.6 1.34  7   8   8  10   10 ▃▇▁▁▇      8
5 species       Long          0             1  8.8 4.09  4   7   8  10   15 ▃▇▃▁▃      8
6 species       Short         0             1 11.4 3.36  7   9  12  14   15 ▃▃▁▃▇     12
7 tree          Long          0             1  8.8 3.83  6   6   7  10   15 ▇▁▂▁▂      7
8 tree          Short         0             1  9   2.55  6   8   9   9   13 ▃▃▇▁▃      9

The summary statistics shown, like median, can be customized with sfl() passed into the skim_with factory.

The resulting summary is in tall form based on grouping variable index. This is better to work with than many summary columns in a wide format. You can also get the summary dataframe instead of the printed text summary.

Ruminate answered 23/5 at 22:28 Comment(0)

Recommended topics

Hot tags