How to create summaries of subgroups based on factors in R

B

3

1

I want to calculate the mean for each numeric variable in the following example. These need to be grouped by each factor associated with "id" and by each factor associated with"status".

set.seed(10)
dfex <- 
data.frame(id=c("2","1","1","1","3","2","3"),status=c("hit","miss","miss","hit","miss","miss","miss"),var3=rnorm(7),var4=rnorm(7),var5=rnorm(7),var6=rnorm(7))

For the means of "id" groups, the first row of output would be labeled "mean-id-1". Rows labeled "mean-id-2" and "mean-id-3" would follow. For the means of "status" groups, the rows would be labeled "mean-status-miss" and "mean-status-hit". My objective is to generate these means and their row labels programatically.

I've tried many different permutations of apply functions, but each has issues. I've also experimented with the aggregate function.

Brathwaite answered 13/6, 2014 at 5:57 Comment(0)

T

0

With base R the following works for the "id" column:

means_id <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$id),mean)
rownames(means_id) <- paste0("mean-id-",means_id$Group.1)
means_id$Group.1 <- NULL

Output:

                var3       var4       var5       var6
mean-id-1 -0.7182503 -0.2604572 -0.3535823 -1.3530417
mean-id-2  0.2042702 -0.3009548  0.6121843 -1.4364211
mean-id-3 -0.4567655  0.8716131  0.1646053 -0.6229102

The same for the "status" column:

means_status <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$status),mean)
rownames(means_status) <- paste0("mean-status-",means_status$Group.1)
means_status$Group.1 <- NULL

Thies answered 13/6, 2014 at 6:14 Comment(1)

Insa,your "aggregate" solution seems to be the one that most easily allows me to replace mean with sd, so I like the flexibility. I removed 'grep("var",' from the first line of your code because in my full dataset, the column names don't follow a pattern. In dfex, this results in two columns - "id" and "status" - being populated with NAs, but it's easy enough to remove the columns. Thank you! – Brathwaite 14/6, 2014 at 0:1

S

1

Probably the fastest way to do this will be with data.table (for big data sets), although I didn't find a way to present new row names in data.table object, thus I converted it back to data.frame

library(data.table)
setDT(dfex) # convert `dfex` to a `data.table` object
#setkey(dfex, id) # This is not necessary, only if you want to sort your table by "id" column first
dat1 <- as.data.frame(dfex[,-2, with = F][, lapply(.SD, mean), by = id])
rownames(dat1) <- paste0("mean-id-", as.character(dat1[,"id"]))
dat2 <- as.data.frame(dfex[,-1, with = F][, lapply(.SD, mean), by = status])
rownames(dat2) <- paste0("mean-status-", as.character(dat2[,"status"]))

Sinuous answered 13/6, 2014 at 6:43 Comment(2)

Thanks David. I'm having trouble understanding how exactly this works. Can you comment on '[,-2, with = F]' and on 'lapply(.SD, mean)'? – Brathwaite 13/6, 2014 at 19:3

[,-2, with = F] is just data.table way to reference a column, i.e., it means "remove second column". lapply(.SD, mean) means "apply the function mean on all the columns". You can manually specify the columns you want to apply the function on using .SDcols. See ?data.table for further explanations. I would recommend using this package in the future for aggregations if you have big data sets – Sinuous 14/6, 2014 at 18:21

T

0

With base R the following works for the "id" column:

means_id <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$id),mean)
rownames(means_id) <- paste0("mean-id-",means_id$Group.1)
means_id$Group.1 <- NULL

Output:

                var3       var4       var5       var6
mean-id-1 -0.7182503 -0.2604572 -0.3535823 -1.3530417
mean-id-2  0.2042702 -0.3009548  0.6121843 -1.4364211
mean-id-3 -0.4567655  0.8716131  0.1646053 -0.6229102

The same for the "status" column:

means_status <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$status),mean)
rownames(means_status) <- paste0("mean-status-",means_status$Group.1)
means_status$Group.1 <- NULL

Thies answered 13/6, 2014 at 6:14 Comment(1)

Insa,your "aggregate" solution seems to be the one that most easily allows me to replace mean with sd, so I like the flexibility. I removed 'grep("var",' from the first line of your code because in my full dataset, the column names don't follow a pattern. In dfex, this results in two columns - "id" and "status" - being populated with NAs, but it's easy enough to remove the columns. Thank you! – Brathwaite 14/6, 2014 at 0:1

A

0

You could do:

do.call(rbind,by(dfex[,-(1:2)], paste("mean-id",dfex[,1],sep="-"), colMeans)) 
              var3       var4       var5       var6
mean-id-1 -0.7383944  0.5005763 -0.4777325  0.6988741
mean-id-2 -0.0316267 -0.1764453  0.1313834  0.6867287
mean-id-3  0.7489377  0.8091953  0.9290247 -0.1263163

Create both result as a list:

 lapply(c("id","status"), function(x) do.call(rbind,by(dfex[grep("var",names(dfex))], paste("mean-id",dfex[,x],sep="-"), colMeans)))

Update:

library(matrixStats)
 lapply(c("id","status"), function(x) do.call(rbind,by(dfex[grep("var",names(dfex))], paste("mean-id",dfex[,x],sep="-"), colSds)))
 [[1]]
              var3       var4      var5      var6
 mean-id-1 0.6024318 1.36423044 0.5398717 0.7260939
 mean-id-2 0.2623706 0.08870122 0.1827246 1.0590560
 mean-id-3 1.0625137 0.16381062 1.0760977 0.3524908

[[2]]
                  var3     var4      var5      var6
mean-id-hit  0.4369311 1.036234 0.6622341 0.6506010
mean-id-miss 0.8288436 1.035163 0.7688912 0.6799636

Arouse answered 13/6, 2014 at 7:43 Comment(2)

Nice solution, thanks. Not sure if this belongs in a different post, but is there a way to clone your code to calc sd? I saw in another post that the matrix.Stats package had a function called colSds, but when I ran it I got the following error msg: "Error in UseMethod("colCounts") : no applicable method for 'colCounts' applied to an object of class "data.frame" – Brathwaite 13/6, 2014 at 21:30

Hi 'user3614783'. I couldn't reproduce your error using the same dataset example. Please check the edited version. – Arouse 14/6, 2014 at 2:7

Recommended topics

Hot tags