ddply + summarize for repeating same statistical function across large number of columns

Asked 28/5, 2012 at 16:19 Answered 25/4, 2014 at 14:14

Solved r multiple-columns plyr idioms split-apply-combine

Ok, second R question in quick succession.

My data:

           Timestamp    St_01  St_02 ...
1 2008-02-08 00:00:00  26.020 25.840 ...
2 2008-02-08 00:10:00  25.985 25.790 ...
3 2008-02-08 00:20:00  25.930 25.765 ...
4 2008-02-08 00:30:00  25.925 25.730 ...
5 2008-02-08 00:40:00  25.975 25.695 ...
...

Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year).

In the case above, I would create a category, e.g. hour (e.g. strptime(data$Timestamp,"%H") -> data$hour and then use that category in ddply, like ddply(data,"hour", summarize, St_01=mean(St_01), St_02=mean(St_02)...) to average by category across each of the columns.

but here is where it gets sticky. I have more than 40 columns to deal with and I'm not prepared to type them all one by one as parameters to the summarize function. I used to write a loop in shell to generate this code but that's not how programmers solve problems is it?

So pray tell, does anyone have a better way of achieving the same result but with less keystrokes?

Tooling answered 28/5, 2012 at 16:19 Comment(3)

Or reshape the St variables long then use your favorite aggregation functions by, aggregate, ddply to combine by c(hour,index), where index is the variable created in the reshape. – Stockman 28/5, 2012 at 16:30

easy points for ya :-) – Mog 24/4, 2015 at 2:34

lol thanks! hahaha. faith in humanity restored. – Tooling 24/4, 2015 at 2:34

You can use numcolwise() to run a summary over all numeric columns.

Here is an example using iris:

ddply(iris, .(Species), numcolwise(mean))
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

Similarly, there is catcolwise() to summarise over all categorical columns.

See ?numcolwise for more help and examples.

EDIT

An alternative approach is to use reshape2 (proposed by @gsk3). This has more keystrokes in this example, but gives you enormous flexibility:

library(reshape2)

miris <- melt(iris, id.vars="Species")
x <- ddply(miris, .(Species, variable), summarize, mean=mean(value))

dcast(x, Species~variable, value.var="mean")
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

Brigidbrigida answered 28/5, 2012 at 16:28 Comment(3)

one thing. how does it work with summarize? cos i need to summarize categorically within each column too. – Tooling 28/5, 2012 at 16:35

I'm not sure what you mean. Using colwise or family usually means you don't need to use summarize. Can you please expand on your question? – Brigidbrigida 28/5, 2012 at 16:36

@R-L it's either-or, two different approaches. Both have their merits. The melt one is neat if you're going to do a large amount of statistical calculations on each column, and want to peek at intermediate results. – Alfons 31/3, 2014 at 22:41

You can even simplify the second approach proposed by Andrie by omitting the ddply call completely. Just specify mean as the aggregation function in the dcast call:

library(reshape2)
miris <- melt(iris, id.vars="Species")
dcast(miris, Species ~ variable, mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

The same result can also be calculated very fast using the data.table package. The .SD variable in the j expression is a special data.table variable containing the subset of data for each group, excluding all columns used in by.

library(data.table)
dt_iris <- as.data.table(iris)
dt_iris[, lapply(.SD, mean), by = Species]

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1:     setosa        5.006       3.428        1.462       0.246
2: versicolor        5.936       2.770        4.260       1.326
3:  virginica        6.588       2.974        5.552       2.026

Yet another option would be the new version 0.2 of Hadley's dplyr package

library(dplyr)
group_by(iris, Species) %>% summarise_each(funs(mean))

Source: local data frame [3 x 5]

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

Esperance answered 25/4, 2014 at 14:14 Comment(0)

Recommended topics

Hot tags