How to create summaries of subgroups based on factors in R
Asked Answered
B

3

1

I want to calculate the mean for each numeric variable in the following example. These need to be grouped by each factor associated with "id" and by each factor associated with"status".

set.seed(10)
dfex <- 
data.frame(id=c("2","1","1","1","3","2","3"),status=c("hit","miss","miss","hit","miss","miss","miss"),var3=rnorm(7),var4=rnorm(7),var5=rnorm(7),var6=rnorm(7))

For the means of "id" groups, the first row of output would be labeled "mean-id-1". Rows labeled "mean-id-2" and "mean-id-3" would follow. For the means of "status" groups, the rows would be labeled "mean-status-miss" and "mean-status-hit". My objective is to generate these means and their row labels programatically.

I've tried many different permutations of apply functions, but each has issues. I've also experimented with the aggregate function.

Brathwaite answered 13/6, 2014 at 5:57 Comment(0)
T
0

With base R the following works for the "id" column:

means_id <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$id),mean)
rownames(means_id) <- paste0("mean-id-",means_id$Group.1)
means_id$Group.1 <- NULL

Output:

                var3       var4       var5       var6
mean-id-1 -0.7182503 -0.2604572 -0.3535823 -1.3530417
mean-id-2  0.2042702 -0.3009548  0.6121843 -1.4364211
mean-id-3 -0.4567655  0.8716131  0.1646053 -0.6229102

The same for the "status" column:

means_status <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$status),mean)
rownames(means_status) <- paste0("mean-status-",means_status$Group.1)
means_status$Group.1 <- NULL
Thies answered 13/6, 2014 at 6:14 Comment(1)
Insa,your "aggregate" solution seems to be the one that most easily allows me to replace mean with sd, so I like the flexibility. I removed 'grep("var",' from the first line of your code because in my full dataset, the column names don't follow a pattern. In dfex, this results in two columns - "id" and "status" - being populated with NAs, but it's easy enough to remove the columns. Thank you!Brathwaite
S
1

Probably the fastest way to do this will be with data.table (for big data sets), although I didn't find a way to present new row names in data.table object, thus I converted it back to data.frame

library(data.table)
setDT(dfex) # convert `dfex` to a `data.table` object
#setkey(dfex, id) # This is not necessary, only if you want to sort your table by "id" column first
dat1 <- as.data.frame(dfex[,-2, with = F][, lapply(.SD, mean), by = id])
rownames(dat1) <- paste0("mean-id-", as.character(dat1[,"id"]))
dat2 <- as.data.frame(dfex[,-1, with = F][, lapply(.SD, mean), by = status])
rownames(dat2) <- paste0("mean-status-", as.character(dat2[,"status"]))
Sinuous answered 13/6, 2014 at 6:43 Comment(2)
Thanks David. I'm having trouble understanding how exactly this works. Can you comment on '[,-2, with = F]' and on 'lapply(.SD, mean)'?Brathwaite
[,-2, with = F] is just data.table way to reference a column, i.e., it means "remove second column". lapply(.SD, mean) means "apply the function mean on all the columns". You can manually specify the columns you want to apply the function on using .SDcols. See ?data.table for further explanations. I would recommend using this package in the future for aggregations if you have big data setsSinuous
T
0

With base R the following works for the "id" column:

means_id <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$id),mean)
rownames(means_id) <- paste0("mean-id-",means_id$Group.1)
means_id$Group.1 <- NULL

Output:

                var3       var4       var5       var6
mean-id-1 -0.7182503 -0.2604572 -0.3535823 -1.3530417
mean-id-2  0.2042702 -0.3009548  0.6121843 -1.4364211
mean-id-3 -0.4567655  0.8716131  0.1646053 -0.6229102

The same for the "status" column:

means_status <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$status),mean)
rownames(means_status) <- paste0("mean-status-",means_status$Group.1)
means_status$Group.1 <- NULL
Thies answered 13/6, 2014 at 6:14 Comment(1)
Insa,your "aggregate" solution seems to be the one that most easily allows me to replace mean with sd, so I like the flexibility. I removed 'grep("var",' from the first line of your code because in my full dataset, the column names don't follow a pattern. In dfex, this results in two columns - "id" and "status" - being populated with NAs, but it's easy enough to remove the columns. Thank you!Brathwaite
A
0

You could do:

do.call(rbind,by(dfex[,-(1:2)], paste("mean-id",dfex[,1],sep="-"), colMeans)) 
              var3       var4       var5       var6
mean-id-1 -0.7383944  0.5005763 -0.4777325  0.6988741
mean-id-2 -0.0316267 -0.1764453  0.1313834  0.6867287
mean-id-3  0.7489377  0.8091953  0.9290247 -0.1263163

Create both result as a list:

 lapply(c("id","status"), function(x) do.call(rbind,by(dfex[grep("var",names(dfex))], paste("mean-id",dfex[,x],sep="-"), colMeans)))

Update:

library(matrixStats)
 lapply(c("id","status"), function(x) do.call(rbind,by(dfex[grep("var",names(dfex))], paste("mean-id",dfex[,x],sep="-"), colSds)))
 [[1]]
              var3       var4      var5      var6
 mean-id-1 0.6024318 1.36423044 0.5398717 0.7260939
 mean-id-2 0.2623706 0.08870122 0.1827246 1.0590560
 mean-id-3 1.0625137 0.16381062 1.0760977 0.3524908

[[2]]
                  var3     var4      var5      var6
mean-id-hit  0.4369311 1.036234 0.6622341 0.6506010
mean-id-miss 0.8288436 1.035163 0.7688912 0.6799636
Arouse answered 13/6, 2014 at 7:43 Comment(2)
Nice solution, thanks. Not sure if this belongs in a different post, but is there a way to clone your code to calc sd? I saw in another post that the matrix.Stats package had a function called colSds, but when I ran it I got the following error msg: "Error in UseMethod("colCounts") : no applicable method for 'colCounts' applied to an object of class "data.frame"Brathwaite
Hi 'user3614783'. I couldn't reproduce your error using the same dataset example. Please check the edited version.Arouse

© 2022 - 2024 — McMap. All rights reserved.