Taking column mean over a list of data frames in R
Asked Answered
S

4

7

Here's what I'm trying to do. My data frame has a factor variable, "country", and I want to split the data frame based on country. Then, I want to take the column mean over every variable for every country's data frame.

Data here: https://github.com/pourque/country-data

I've done this so far...

myList <- split(df1, df1$country)
for(i in 1:length(myList)) {
aggregate <- mapply(myList[[i]][,-c(38:39)], colMeans)
}

(I'm not including the 38th and 39th columns because those are factors.)

I've read this (function over more than one list) , which makes me think mapply is the answer here...but I'm getting this error:

Error in match.fun(FUN) : 
'myList[[i]][, -c(38:39)]' is not a function, character or symbol 

Maybe I'm formatting it incorrectly?

Shulins answered 4/5, 2015 at 16:55 Comment(2)
Why bother splitting in to separate data frames? Just use data.table or dplyr.Pervert
This is turning into a good post now !Cowper
W
4

A data.table answer:

library(data.table)

setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c('age', 'gender')]

Now tidier syntax with deselection in .SDcols, thanks to user Arun

To explain what's happening here:

  • setDT(df1) make the data.frame a data.table
  • lapply(.SD, mean) for each column in the subset of data, take the mean
  • by = county do this by groups split according to country
  • .SDcols = -c('age', 'gender') omit age and gender columns from the subset of data
Weinberger answered 4/5, 2015 at 17:10 Comment(6)
I kind of don't like the readability of this vs some of the other answers. Any other data.table people want to suggest an improvement?Weinberger
Instead of the setdiff thing, couldn't you just use .SDcols?Metrify
Could specify all the desired cols to .SDcols, but that seems about the same as the above. Is there a tidy way to use .SDcols to omit specified columns?Weinberger
setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c("age", "gender")]Inspire
It looks like the latest version has nicer syntax for .SDcols, yeah, in items 14-17 here: github.com/Rdatatable/… Even with the CRAN version of data.table you could do df1[ , lapply(.SD, mean), by=country,.SDcols=setdiff(names(df1),c('age','gender'))], though, right?Metrify
Aha, thanks @Arun! Didn't know you could do deselection with - in .SDcols. @Frank, yes def, but that seemed similar is all i meant (in that it still requires setdiff)Weinberger
C
7

It's straightforward in base R using aggregate without the need to split the data.frame into a list beforehand. Here's an example using the built-in iris data where you compute the mean of all variables except those in the first and second column by group of Species:

data(iris)
aggregate(. ~ Species, iris[-(1:2)], mean)
#     Species Petal.Length Petal.Width
#1     setosa        1.462       0.246
#2 versicolor        4.260       1.326
#3  virginica        5.552       2.026

The . inside aggregate is used to specify that you want to use all remaining columns of the data.frame except the grouping variable (Species in this case). And because you specify iris[-(1:2)] as input data, the first and second columns are not used either.


For your data, it should then be something like:

aggregate(. ~ country, df1[-c(38:39)], mean)
Cariotta answered 4/5, 2015 at 17:19 Comment(8)
This is a great answer. Thank you for the thorough explanation! When I run it, though, it still averages gender (col. 39) instead of omitting it. (Not a big deal for the data, but is there a way to prevent this?)Shulins
@Anshu, can you try again please with the slightly modified code at the bottom?Cariotta
Or try using df1[-c("age", "gender")] inside aggregateCariotta
It's especially weird that I'm getting averaged gender values of >3 since the only gender choices are supposed to be 1 or 2. E.g., df1[df1$country=="AF",] gives only one observation whose gender is 2, but the aggregate gives gender=3!Shulins
The modified version doesn't seem to be working, and attempting df1[-c("age", "gender")] gives the warning "invalid argument to unary operator" :(Shulins
Thats strange. I haven't downloaded your data yet. Might do that when I'm back on my computerCariotta
@Anshu, I downloaded your data, read it into R and then ran x <- aggregate(. ~ country, df1[-c(38:39)], mean). It produces the correct output and doesn't include the columns "gender" or "age". There must have been a typo in the code you tried or perhaps a different data structure than in the example on github.Cariotta
Ah, thanks! I got it to work in the end...thanks for all your help :DShulins
P
6
library(dplyr)

df1 %>%
    group_by(country) %>%
    select(-age, -gender) %>%
    summarise_each(funs(mean))
Pervert answered 4/5, 2015 at 17:2 Comment(1)
You could also unselect the two columns inside the summarise_eachCariotta
W
4

A data.table answer:

library(data.table)

setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c('age', 'gender')]

Now tidier syntax with deselection in .SDcols, thanks to user Arun

To explain what's happening here:

  • setDT(df1) make the data.frame a data.table
  • lapply(.SD, mean) for each column in the subset of data, take the mean
  • by = county do this by groups split according to country
  • .SDcols = -c('age', 'gender') omit age and gender columns from the subset of data
Weinberger answered 4/5, 2015 at 17:10 Comment(6)
I kind of don't like the readability of this vs some of the other answers. Any other data.table people want to suggest an improvement?Weinberger
Instead of the setdiff thing, couldn't you just use .SDcols?Metrify
Could specify all the desired cols to .SDcols, but that seems about the same as the above. Is there a tidy way to use .SDcols to omit specified columns?Weinberger
setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c("age", "gender")]Inspire
It looks like the latest version has nicer syntax for .SDcols, yeah, in items 14-17 here: github.com/Rdatatable/… Even with the CRAN version of data.table you could do df1[ , lapply(.SD, mean), by=country,.SDcols=setdiff(names(df1),c('age','gender'))], though, right?Metrify
Aha, thanks @Arun! Didn't know you could do deselection with - in .SDcols. @Frank, yes def, but that seemed similar is all i meant (in that it still requires setdiff)Weinberger
C
4

If you insist on keeping all in list:

#split and make list of df
myList <- split(df, df$country)

#aggregate without age and gender
my_aggregate <- function(df_inlist) {
  df_inlist <- aggregate(.~country, df_inlist[ , -c(38, 39)], mean)
}

#Apply aggregate function on all data frames in the list
out <- lapply(myList, function (x) {
  my_aggregate(x)
})

out is a list of data.frames for each country and colmeans over variables. How put it all together in a data.frame :

composite_df <- do.call(rbind, out)
Cowper answered 4/5, 2015 at 17:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.