Compute mean and standard deviation by group for multiple variables in a data.frame
Asked Answered
D

7

32

Edit -- This question was originally titled << Long to wide data reshaping in R >>


I'm just learning R and trying to find ways to apply it to help out others in my life. As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online. What I'm starting with looks like this:

ID  Obs 1   Obs 2   Obs 3
1   43      48      37
1   27      29      22
1   36      32      40
2   33      38      36
2   29      32      27
2   32      31      35
2   25      28      24
3   45      47      42
3   38      40      36

And what I want to end up with will look like this:

ID  Obs 1 mean  Obs 1 std dev   Obs 2 mean  Obs 2 std dev
1   x           x               x           x
2   x           x               x           x
3   x           x               x           x

And so forth. What I'm unsure of is whether I need additional information in my long-form data, or what. I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.

Thanks very much for any help.

Dunston answered 3/5, 2013 at 20:57 Comment(2)
Just a comment: I don't think that's what folks usually mean by moving from long to wide format.Stig
Plenty have commented, but I am surprised no one cared to fix such a misleading title (now done.)Beachcomber
L
38

This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID. There are many packages that handle such problems. In the base of R it can be done using aggregate like this (assuming DF is the input data frame):

ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))

Note 1: A commenter pointed out that ag is a data frame for which some columns are matrices. Although initially that may seem strange, in fact it simplifies access. ag has the same number of columns as the input DF. Its first column ag[[1]] is ID and the ith column of the remainder ag[[i+1]] (or equivalanetly ag[-1][[i]]) is the matrix of statistics for the ith input observation column. If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j] which can also be written as ag[-1][[i]][, j] .

On the other hand, suppose there are k statistic columns for each observation in the input (where k=2 in the question). Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]] or equivalently ag[-1][[k*(i-1)+j]] .

For example, compare the simplicity of the first expression vs. the second:

ag[-1][[2]]
##        mean      sd
## [1,] 36.333 10.2144
## [2,] 32.250  4.1932
## [3,] 43.500  4.9497

ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
##   Obs_2.mean Obs_2.sd
## 1     36.333  10.2144
## 2     32.250   4.1932
## 3     43.500   4.9497

Note 2: The input in reproducible form is:

Lines <- "ID  Obs_1   Obs_2   Obs_3
1   43      48      37
1   27      29      22
1   36      32      40
2   33      38      36
2   29      32      27
2   32      31      35
2   25      28      24
3   45      47      42
3   38      40      36"
DF <- read.table(text = Lines, header = TRUE)
Lightning answered 3/5, 2013 at 21:32 Comment(2)
Perhaps important to note: While the output of this will appear to be a data.frame with two columns for each column being aggregated (resulting in 7 columns with your example data), if you view the structure, you'll see that it is actually just four columns, with the aggregated columns being matrices. You can fix that with a do.call(data.frame, aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))).Blague
@Ananda Mahto, Good point. I have added some comemnts elaborating on this.Lightning
L
23

There are a few different ways to go about it. reshape2 is a helpful package. Personally, I like using data.table

Below is a step-by-step

If myDF is your data.frame:

library(data.table)
DT <- data.table(myDF)

DT

# this will get you your mean and SD's for each column
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))]

# adding a `by` argument will give you the groupings
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID]

# If you would like to round the values: 
DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID]

# If we want to add names to the columns 
wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD"))))

wide

   ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD
1:  1    35.333    8.021    36.333   10.214      33.0    9.644
2:  2    29.750    3.594    32.250    4.193      30.5    5.916
3:  3    41.500    4.950    43.500    4.950      39.0    4.243

Also, this may or may not be helpful

> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]]
        Obs.1 Obs.2 Obs.3
Min.    25.00 28.00 22.00
1st Qu. 29.00 31.00 27.00
Median  33.00 32.00 36.00
Mean    34.22 36.11 33.22
3rd Qu. 38.00 40.00 37.00
Max.    45.00 48.00 42.00
Lothario answered 3/5, 2013 at 21:8 Comment(2)
I tried this and got Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. Traceback showed that the problem was with the form of the call to sapply.Engedi
Is it possible to use the same method grouping by multiple factors? For example, by=c("ID", "factor2")?Antiscorbutic
C
18

Here is probably the simplest way to go about it (with a reproducible example):

library(plyr)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1),
  Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2))

   ID  Obs_1_mean Obs_1_std_dev  Obs_2_mean Obs_2_std_dev
1  1 -0.13994642     0.8258445 -0.15186380     0.4251405
2  2  1.49982393     0.2282299  0.50816036     0.5812907
3  3 -0.09269806     0.6115075 -0.01943867     1.3348792

EDIT: The following approach saves you a lot of typing when dealing with many columns.

ddply(df, .(ID), colwise(mean))

  ID      Obs_1      Obs_2      Obs_3
1  1 -0.3748831  0.1787371  1.0749142
2  2 -1.0363973  0.0157575 -0.8826969
3  3  1.0721708 -1.1339571 -0.5983944

ddply(df, .(ID), colwise(sd))

  ID     Obs_1     Obs_2     Obs_3
1  1 0.8732498 0.4853133 0.5945867
2  2 0.2978193 1.0451626 0.5235572
3  3 0.4796820 0.7563216 1.4404602
Carabineer answered 3/5, 2013 at 21:16 Comment(2)
There's one more observation you missed out. While this is the way to go with fewer columns, I think it gets ugly very quickly.Allogamy
can we calculate mean of rows using this method ?Tica
C
14

I add the dplyr solution.

set.seed(1)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))

library(dplyr)
df %>% group_by(ID) %>% summarise_each(funs(mean, sd))

#      ID Obs_1_mean Obs_2_mean Obs_3_mean  Obs_1_sd  Obs_2_sd  Obs_3_sd
#   (int)      (dbl)      (dbl)      (dbl)     (dbl)     (dbl)     (dbl)
# 1     1  0.4854187 -0.3238542  0.7410611 1.1108687 0.2885969 0.1067961
# 2     2  0.4171586 -0.2397030  0.2041125 0.2875411 1.8732682 0.3438338
# 3     3 -0.3601052  0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692
Cedar answered 14/4, 2016 at 9:13 Comment(0)
L
8

Here's another take on the data.table answers, using @Carson's data, that's a bit more readable (and also a little faster, because of using lapply instead of sapply):

library(data.table)
set.seed(1)
dt = data.table(ID=c(1:3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))

dt[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID]
#   ID mean.Obs_1 mean.Obs_2 mean.Obs_3  sd.Obs_1  sd.Obs_2  sd.Obs_3
#1:  1  0.4854187 -0.3238542  0.7410611 1.1108687 0.2885969 0.1067961
#2:  2  0.4171586 -0.2397030  0.2041125 0.2875411 1.8732682 0.3438338
#3:  3 -0.3601052  0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692
Lotti answered 3/5, 2013 at 21:31 Comment(5)
the second one should use sd and you use .SD twice.. is there a performance issue due to that? any idea?Allogamy
@Arun, thanks, fixed the sd bit. I don't know if there is a performance hit because of that, let me checkLotti
@Allogamy looks like there is an ~10% performance hit, but the good news is that it doesn't increase with more categoriesLotti
Also you'll see a optimisation message about creating names (mean, sd) for every by (which will be inefficient for huge data. I'm benchmarking on a 1e6 data.table. Will post the results shortly.Allogamy
This works for me, however the resulting columns all have the same name, i.e. Obs_1,Obs_2,Obs_3,Obs_1,Obs_2,Obs_3. not mean.Obs_1... any ideas why that is the case?Furculum
P
3

The updated dplyr solution, as for 2020

1: summarise_each_() is deprecated as of dplyr 0.7.0. and 2: funs() is deprecated as of dplyr 0.8.0.

ag.dplyr <- DF %>% group_by(ID) %>% summarise(across(.cols = everything(),list(mean = mean, sd = sd)))
Predation answered 20/10, 2020 at 10:36 Comment(0)
R
1

There is a helpful function in the psych package.

You should try the following implementation:

psych::describeBy(data$dependentvariable, group = data$groupingvariable)
Retouch answered 27/1, 2020 at 10:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.