Name columns within aggregate in R
Asked Answered
A

4

74

I know I can *re*name columns after I aggregate the data:

blubb <- aggregate(dat$two ~ dat$one, ...)
colnames(blubb) <- c("One", "Two")

Nothing wrong with that. But is there a way to aggregate and name the columns in one go? Sort of like:

blubb <- aggregate( ... , cols = c("One", "Two"))

It would be escpecially nice (and typo-proof) to somehow catch the original column names and do like:

blubb <- aggregate( ... , cols = c(name_of_dat$one, name_of_dat$two."_Mean"))
Auspex answered 9/3, 2013 at 9:31 Comment(0)
E
93

You can use setNames as in:

blubb <- setNames(aggregate(dat$two ~ dat$one, ...), c("One", "Two"))

Alternatively, you can bypass the slick formula method, and use syntax like:

blubb <- aggregate(list(One = dat$one), list(Two = dat$two), ...)

Update

This update is to just help get you started on deriving a solution on your own.

If you inspect the code for stats:::aggregate.formula, you'll see the following lines towards the end:

if (is.matrix(mf[[1L]])) {
    lhs <- as.data.frame(mf[[1L]])
    names(lhs) <- as.character(m[[2L]][[2L]])[-1L]
    aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...)
}
else aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...)

If all that you want to do is append the function name to the variable that was aggregated, perhaps you can change that to something like:

if (is.matrix(mf[[1L]])) {
  lhs <- as.data.frame(mf[[1L]])
  names(lhs) <- as.character(m[[2L]][[2L]])[-1L]
  myOut <- aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...)
  colnames(myOut) <- c(names(mf[-1L]), 
                       paste(names(lhs), deparse(substitute(FUN)), sep = "."))
}
else {
  myOut <- aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...)
  colnames(myOut) <- c(names(mf[-1L]), 
                       paste(strsplit(gsub("cbind\\(|\\)|\\s", "", 
                                           names(mf[1L])), ",")[[1]],
                             deparse(substitute(FUN)), sep = "."))
} 
myOut

This basically captures the value entered for FUN by using deparse(substitute(FUN)), so you can probably modify the function to accept a custom suffix, or perhaps even a vector of suffixes. This can probably be improved a bit with some work, but I'm not going to do it!

Here is a Gist with this concept applied, creating a function named "myAgg".

Here is some sample output of just the resulting column names:

> names(myAgg(weight ~ feed, data = chickwts, mean))
[1] "feed"        "weight.mean"
> names(myAgg(breaks ~ wool + tension, data = warpbreaks, sum))
[1] "wool"       "tension"    "breaks.sum"
> names(myAgg(weight ~ feed, data = chickwts, FUN = function(x) mean(x^2)))
[1] "feed"                         "weight.function(x) mean(x^2)"

Notice that only the aggregated variable name changes. But notice also that if you use a custom function, you'll end up with a really strange column name!

Event answered 9/3, 2013 at 9:34 Comment(6)
Thank you. Does that mean that it is definitely impossible to set the column names inside the aggregate() parenthesis? Also, I edited my question, maybe you have further ideas on the second half of my question.Auspex
I like the alternative :-)Auspex
I'll look into your problem in a little while. In the meantime, take a look at this wrapper around aggregate, which appends the function applied to the aggregated column name. I'm not posting it within my answer because it does limit the functionality of aggregate somewhat, and was just written for a project I was working on.Event
It's also possible to use the names without list(). blubb <- aggregate(One = dat$one, Two = dat$two, data=dat...)Vomer
@Matt, (1) that doesn't seem to work for me, (2) how would that work for more than one grouping variable, (3) why would you need dat$ if you specify data=dat (which is an argument to the formula method for aggregate, not the method you're proposing here). If I'm missing something, do let me know. Thanks!Event
Late comment, but how could I do it if I have e.g. blubb <- aggregate(one ~two, FUN = function(x) c("mean" = mean(x), "median" = median(x))? So multiple output columnsCurrycomb
A
13

The answer to your first question is yes. You can certainly include the column names in the aggregate function. Using the names from your example above:

blubb <- aggregate(dat,list(One=dat$One,Two=dat$Two),sum)

I like the part about possibly pulling in the original column names automatically. If I figure it out I'll post it.

Arnulfo answered 18/9, 2013 at 8:49 Comment(3)
And how do you change the name of the column that results from the function sum? Right now it's just xKite
@Kite I believe you can just stick it in the 'list' portion of the function e.g. ...list(One = dat$One, Two = dat$Two, MyResults = dat$x) ...Arnulfo
To follow up, I think x still remains unnamed. In list(One=dat$One, etc.) that will name the grouping variable but not x.Kite
T
11

In case you prefer writing aggregates as formula the documentation shows the usage of cbind. And cbind allows you to name its arguments, which are used by aggregate.

aggregate(cbind(SLength = Sepal.Length) ~ cbind(Type = Species),
  data = iris, mean)
#  Type SLength
#1    1   5.006
#2    2   5.936
#3    3   6.588

But cbind replaces factors by their internal codes and when used on the right side of ~ it keeps the cbind call in the result. To avoid this the new names can already be given in the input data set using transform or within

aggregate(SLength ~ Type, transform(iris, SLength = Sepal.Length,
   Type = Species), mean)
#        Type SLength
#1     setosa   5.006
#2 versicolor   5.936
#3  virginica   6.588

or

aggregate(cbind(SLength = Sepal.Length) ~ Type, transform(iris, Type = Species),
  mean)
#        Type SLength
#1     setosa   5.006
#2 versicolor   5.936
#3  virginica   6.588

The advantage of using cbind or data.frame compared to list is that not all columns need to be given a (new) name. Aggregation of more than one column by more than one grouping factor could be done like:

aggregate(cbind("Miles/gallon" = mpg, Weight = wt, hp) ~ Cylinders + Carburetors
          + gear, transform(mtcars, Cylinders = cyl, Carburetors = carb), mean)
#   Cylinders Carburetors gear Miles/gallon  Weight    hp
#1          4           1    3        21.50 2.46500  97.0
#2          6           1    3        19.75 3.33750 107.5
#...

and if you want to use more than one function:

aggregate(cbind(cases=ncases, ncontrols) ~ alc + tobgp, transform(esoph,
  alc=alcgp), FUN = function(x) c("mean" = mean(x), "median" = median(x)))
#         alc    tobgp cases.mean cases.median ncontrols.mean ncontrols.median
#1  0-39g/day 0-9g/day  1.5000000    1.0000000      42.000000        44.000000
#2      40-79 0-9g/day  5.6666667    4.0000000      24.166667        29.000000
#...

which adds to the colname the used aggregate-function.

Some examples to aggregate mpg renamed to Miles/gallon and hp by cyl renamed to Cylinders and gear using mtcars resulting in:

#  Cylinders gear Miles/gallon       hp
#1         4    3       21.500  97.0000
#2         6    3       19.750 107.5000
#3         8    3       15.050 194.1667
#4         4    4       26.925  76.0000
#5         6    4       19.750 116.5000
#6         4    5       28.200 102.0000
#7         6    5       19.700 175.0000
#8         8    5       15.400 299.5000
aggregate(cbind("Miles/gallon" = mpg, hp) ~ Cylinders + gear,
          transform(mtcars, Cylinders = cyl, Carburetors = carb), mean)

with(mtcars, aggregate(cbind("Miles/gallon" = mpg, hp),
                       data.frame(Cylinders = cyl, gear), mean))

with(mtcars, aggregate(data.frame("Miles/gallon" = mpg, hp),
                       data.frame(Cylinders = cyl, gear), mean))
Trentontrepan answered 12/6, 2019 at 9:42 Comment(1)
This looks better than all the other solutions. Naming through cbind allows for more complex cases than list, and avoids post-hoc operations that rely on knowing the order of the output columns in advance and might become outdated if the formula is modified.Lucianolucias
F
-4
w <- data.frame(Funding<-"Fully Insured",Region="North East",claim_count=rnbinom(1000, 300.503572818, mu= 0.5739467))
x <- data.frame(Funding<-"Fully Insured",Region="South East",claim_count=rnbinom(1000, 1000, mu= 0.70000000))
y <- data.frame(Funding<-"Self Insured",Region="North East",claim_count=rnbinom(1000, 400, mu= 0.80000000))
z <- data.frame(Funding<-"Self Insured",Region="South East",claim_count=rnbinom(1000, 700, mu= 1.70000000))
names(w)<-c("Funding","Region","claim_count")
names(x)<-c("Funding","Region","claim_count")
names(y)<-c("Funding","Region","claim_count")
names(z)<-c("Funding","Region","claim_count")
my_df <- rbind(w,x,y,z)
my_df2<-with(my_df, aggregate(x=claim_count, by=list(Funding,Region), FUN=sum))
colnames(my_df2)<-colnames(my_df)
Flutter answered 18/8, 2015 at 23:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.