Taking column mean over a list of data frames in R

Asked 4/5, 2015 at 16:55 Answered 4/5, 2015 at 17:19

Solved r list dataframe aggregate group-summaries

Here's what I'm trying to do. My data frame has a factor variable, "country", and I want to split the data frame based on country. Then, I want to take the column mean over every variable for every country's data frame.

Data here: https://github.com/pourque/country-data

I've done this so far...

myList <- split(df1, df1$country)
for(i in 1:length(myList)) {
aggregate <- mapply(myList[[i]][,-c(38:39)], colMeans)
}

(I'm not including the 38th and 39th columns because those are factors.)

I've read this (function over more than one list) , which makes me think mapply is the answer here...but I'm getting this error:

Error in match.fun(FUN) : 
'myList[[i]][, -c(38:39)]' is not a function, character or symbol

Maybe I'm formatting it incorrectly?

Shulins answered 4/5, 2015 at 16:55 Comment(2)

Why bother splitting in to separate data frames? Just use data.table or dplyr. – Pervert 4/5, 2015 at 16:58

This is turning into a good post now ! – Cowper 4/5, 2015 at 17:27

A data.table answer:

library(data.table)

setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c('age', 'gender')]

Now tidier syntax with deselection in .SDcols, thanks to user Arun

To explain what's happening here:

setDT(df1) make the data.frame a data.table
lapply(.SD, mean) for each column in the subset of data, take the mean
by = county do this by groups split according to country
.SDcols = -c('age', 'gender') omit age and gender columns from the subset of data

Weinberger answered 4/5, 2015 at 17:10 Comment(6)

I kind of don't like the readability of this vs some of the other answers. Any other data.table people want to suggest an improvement? – Weinberger 4/5, 2015 at 17:20

Instead of the setdiff thing, couldn't you just use .SDcols? – Metrify 4/5, 2015 at 17:23

Could specify all the desired cols to .SDcols, but that seems about the same as the above. Is there a tidy way to use .SDcols to omit specified columns? – Weinberger 4/5, 2015 at 17:31

setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c("age", "gender")] – Inspire 4/5, 2015 at 17:34

It looks like the latest version has nicer syntax for .SDcols, yeah, in items 14-17 here: github.com/Rdatatable/… Even with the CRAN version of data.table you could do df1[ , lapply(.SD, mean), by=country,.SDcols=setdiff(names(df1),c('age','gender'))], though, right? – Metrify 4/5, 2015 at 17:35

Aha, thanks @Arun! Didn't know you could do deselection with - in .SDcols. @Frank, yes def, but that seemed similar is all i meant (in that it still requires setdiff) – Weinberger 4/5, 2015 at 17:54

It's straightforward in base R using aggregate without the need to split the data.frame into a list beforehand. Here's an example using the built-in iris data where you compute the mean of all variables except those in the first and second column by group of Species:

data(iris)
aggregate(. ~ Species, iris[-(1:2)], mean)
#     Species Petal.Length Petal.Width
#1     setosa        1.462       0.246
#2 versicolor        4.260       1.326
#3  virginica        5.552       2.026

The . inside aggregate is used to specify that you want to use all remaining columns of the data.frame except the grouping variable (Species in this case). And because you specify iris[-(1:2)] as input data, the first and second columns are not used either.

For your data, it should then be something like:

aggregate(. ~ country, df1[-c(38:39)], mean)

Cariotta answered 4/5, 2015 at 17:19 Comment(8)

This is a great answer. Thank you for the thorough explanation! When I run it, though, it still averages gender (col. 39) instead of omitting it. (Not a big deal for the data, but is there a way to prevent this?) – Shulins 4/5, 2015 at 17:49

@Anshu, can you try again please with the slightly modified code at the bottom? – Cariotta 4/5, 2015 at 17:52

Or try using df1[-c("age", "gender")] inside aggregate – Cariotta 4/5, 2015 at 17:53

It's especially weird that I'm getting averaged gender values of >3 since the only gender choices are supposed to be 1 or 2. E.g., df1[df1$country=="AF",] gives only one observation whose gender is 2, but the aggregate gives gender=3! – Shulins 4/5, 2015 at 17:55

The modified version doesn't seem to be working, and attempting df1[-c("age", "gender")] gives the warning "invalid argument to unary operator" :( – Shulins 4/5, 2015 at 17:59

Thats strange. I haven't downloaded your data yet. Might do that when I'm back on my computer – Cariotta 4/5, 2015 at 18:2

@Anshu, I downloaded your data, read it into R and then ran x <- aggregate(. ~ country, df1[-c(38:39)], mean). It produces the correct output and doesn't include the columns "gender" or "age". There must have been a typo in the code you tried or perhaps a different data structure than in the example on github. – Cariotta 4/5, 2015 at 21:9

Ah, thanks! I got it to work in the end...thanks for all your help :D – Shulins 4/5, 2015 at 22:45

library(dplyr)

df1 %>%
    group_by(country) %>%
    select(-age, -gender) %>%
    summarise_each(funs(mean))

Pervert answered 4/5, 2015 at 17:2 Comment(1)

You could also unselect the two columns inside the summarise_each – Cariotta 4/5, 2015 at 17:37

A data.table answer:

library(data.table)

setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c('age', 'gender')]

Now tidier syntax with deselection in .SDcols, thanks to user Arun

To explain what's happening here:

setDT(df1) make the data.frame a data.table
lapply(.SD, mean) for each column in the subset of data, take the mean
by = county do this by groups split according to country
.SDcols = -c('age', 'gender') omit age and gender columns from the subset of data

Weinberger answered 4/5, 2015 at 17:10 Comment(6)

I kind of don't like the readability of this vs some of the other answers. Any other data.table people want to suggest an improvement? – Weinberger 4/5, 2015 at 17:20

Instead of the setdiff thing, couldn't you just use .SDcols? – Metrify 4/5, 2015 at 17:23

Could specify all the desired cols to .SDcols, but that seems about the same as the above. Is there a tidy way to use .SDcols to omit specified columns? – Weinberger 4/5, 2015 at 17:31

setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c("age", "gender")] – Inspire 4/5, 2015 at 17:34

If you insist on keeping all in list:

#split and make list of df
myList <- split(df, df$country)

#aggregate without age and gender
my_aggregate <- function(df_inlist) {
  df_inlist <- aggregate(.~country, df_inlist[ , -c(38, 39)], mean)
}

#Apply aggregate function on all data frames in the list
out <- lapply(myList, function (x) {
  my_aggregate(x)
})

out is a list of data.frames for each country and colmeans over variables. How put it all together in a data.frame :

composite_df <- do.call(rbind, out)

Cowper answered 4/5, 2015 at 17:13 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags