Name columns within aggregate in R

Asked 9/3, 2013 at 9:31 Answered 12/6, 2019 at 9:42

I know I can *re*name columns after I aggregate the data:

blubb <- aggregate(dat$two ~ dat$one, ...)
colnames(blubb) <- c("One", "Two")

Nothing wrong with that. But is there a way to aggregate and name the columns in one go? Sort of like:

blubb <- aggregate( ... , cols = c("One", "Two"))

It would be escpecially nice (and typo-proof) to somehow catch the original column names and do like:

blubb <- aggregate( ... , cols = c(name_of_dat$one, name_of_dat$two."_Mean"))

Auspex answered 9/3, 2013 at 9:31 Comment(0)

You can use setNames as in:

blubb <- setNames(aggregate(dat$two ~ dat$one, ...), c("One", "Two"))

Alternatively, you can bypass the slick formula method, and use syntax like:

blubb <- aggregate(list(One = dat$one), list(Two = dat$two), ...)

Update

This update is to just help get you started on deriving a solution on your own.

If you inspect the code for stats:::aggregate.formula, you'll see the following lines towards the end:

if (is.matrix(mf[[1L]])) {
    lhs <- as.data.frame(mf[[1L]])
    names(lhs) <- as.character(m[[2L]][[2L]])[-1L]
    aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...)
}
else aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...)

If all that you want to do is append the function name to the variable that was aggregated, perhaps you can change that to something like:

if (is.matrix(mf[[1L]])) {
  lhs <- as.data.frame(mf[[1L]])
  names(lhs) <- as.character(m[[2L]][[2L]])[-1L]
  myOut <- aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...)
  colnames(myOut) <- c(names(mf[-1L]), 
                       paste(names(lhs), deparse(substitute(FUN)), sep = "."))
}
else {
  myOut <- aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...)
  colnames(myOut) <- c(names(mf[-1L]), 
                       paste(strsplit(gsub("cbind\\(|\\)|\\s", "", 
                                           names(mf[1L])), ",")[[1]],
                             deparse(substitute(FUN)), sep = "."))
} 
myOut

This basically captures the value entered for FUN by using deparse(substitute(FUN)), so you can probably modify the function to accept a custom suffix, or perhaps even a vector of suffixes. This can probably be improved a bit with some work, but I'm not going to do it!

Here is a Gist with this concept applied, creating a function named "myAgg".

Here is some sample output of just the resulting column names:

> names(myAgg(weight ~ feed, data = chickwts, mean))
[1] "feed"        "weight.mean"
> names(myAgg(breaks ~ wool + tension, data = warpbreaks, sum))
[1] "wool"       "tension"    "breaks.sum"
> names(myAgg(weight ~ feed, data = chickwts, FUN = function(x) mean(x^2)))
[1] "feed"                         "weight.function(x) mean(x^2)"

Notice that only the aggregated variable name changes. But notice also that if you use a custom function, you'll end up with a really strange column name!

Event answered 9/3, 2013 at 9:34 Comment(6)

Thank you. Does that mean that it is definitely impossible to set the column names inside the aggregate() parenthesis? Also, I edited my question, maybe you have further ideas on the second half of my question. – Auspex 9/3, 2013 at 9:39

I like the alternative :-) – Auspex 9/3, 2013 at 9:44

I'll look into your problem in a little while. In the meantime, take a look at this wrapper around aggregate, which appends the function applied to the aggregated column name. I'm not posting it within my answer because it does limit the functionality of aggregate somewhat, and was just written for a project I was working on. – Event 9/3, 2013 at 10:16

It's also possible to use the names without list(). blubb <- aggregate(One = dat$one, Two = dat$two, data=dat...) – Vomer 13/4, 2016 at 18:42

@Matt, (1) that doesn't seem to work for me, (2) how would that work for more than one grouping variable, (3) why would you need dat$ if you specify data=dat (which is an argument to the formula method for aggregate, not the method you're proposing here). If I'm missing something, do let me know. Thanks! – Event 13/4, 2016 at 18:53

Late comment, but how could I do it if I have e.g. blubb <- aggregate(one ~two, FUN = function(x) c("mean" = mean(x), "median" = median(x))? So multiple output columns – Currycomb 23/12, 2017 at 0:3

The answer to your first question is yes. You can certainly include the column names in the aggregate function. Using the names from your example above:

blubb <- aggregate(dat,list(One=dat$One,Two=dat$Two),sum)

I like the part about possibly pulling in the original column names automatically. If I figure it out I'll post it.

Arnulfo answered 18/9, 2013 at 8:49 Comment(3)

And how do you change the name of the column that results from the function sum? Right now it's just x – Kite 3/2, 2018 at 23:43

@Kite I believe you can just stick it in the 'list' portion of the function e.g. ...list(One = dat$One, Two = dat$Two, MyResults = dat$x) ... – Arnulfo 18/9, 2018 at 20:25

To follow up, I think x still remains unnamed. In list(One=dat$One, etc.) that will name the grouping variable but not x. – Kite 7/2, 2022 at 0:57

In case you prefer writing aggregates as formula the documentation shows the usage of cbind. And cbind allows you to name its arguments, which are used by aggregate.

aggregate(cbind(SLength = Sepal.Length) ~ cbind(Type = Species),
  data = iris, mean)
#  Type SLength
#1    1   5.006
#2    2   5.936
#3    3   6.588

But cbind replaces factors by their internal codes and when used on the right side of ~ it keeps the cbind call in the result. To avoid this the new names can already be given in the input data set using transform or within

aggregate(SLength ~ Type, transform(iris, SLength = Sepal.Length,
   Type = Species), mean)
#        Type SLength
#1     setosa   5.006
#2 versicolor   5.936
#3  virginica   6.588

aggregate(cbind(SLength = Sepal.Length) ~ Type, transform(iris, Type = Species),
  mean)
#        Type SLength
#1     setosa   5.006
#2 versicolor   5.936
#3  virginica   6.588

The advantage of using cbind or data.frame compared to list is that not all columns need to be given a (new) name. Aggregation of more than one column by more than one grouping factor could be done like:

aggregate(cbind("Miles/gallon" = mpg, Weight = wt, hp) ~ Cylinders + Carburetors
          + gear, transform(mtcars, Cylinders = cyl, Carburetors = carb), mean)
#   Cylinders Carburetors gear Miles/gallon  Weight    hp
#1          4           1    3        21.50 2.46500  97.0
#2          6           1    3        19.75 3.33750 107.5
#...

and if you want to use more than one function:

aggregate(cbind(cases=ncases, ncontrols) ~ alc + tobgp, transform(esoph,
  alc=alcgp), FUN = function(x) c("mean" = mean(x), "median" = median(x)))
#         alc    tobgp cases.mean cases.median ncontrols.mean ncontrols.median
#1  0-39g/day 0-9g/day  1.5000000    1.0000000      42.000000        44.000000
#2      40-79 0-9g/day  5.6666667    4.0000000      24.166667        29.000000
#...

which adds to the colname the used aggregate-function.

Some examples to aggregate mpg renamed to Miles/gallon and hp by cyl renamed to Cylinders and gear using mtcars resulting in:

#  Cylinders gear Miles/gallon       hp
#1         4    3       21.500  97.0000
#2         6    3       19.750 107.5000
#3         8    3       15.050 194.1667
#4         4    4       26.925  76.0000
#5         6    4       19.750 116.5000
#6         4    5       28.200 102.0000
#7         6    5       19.700 175.0000
#8         8    5       15.400 299.5000

aggregate(cbind("Miles/gallon" = mpg, hp) ~ Cylinders + gear,
          transform(mtcars, Cylinders = cyl, Carburetors = carb), mean)

with(mtcars, aggregate(cbind("Miles/gallon" = mpg, hp),
                       data.frame(Cylinders = cyl, gear), mean))

with(mtcars, aggregate(data.frame("Miles/gallon" = mpg, hp),
                       data.frame(Cylinders = cyl, gear), mean))

Trentontrepan answered 12/6, 2019 at 9:42 Comment(1)

This looks better than all the other solutions. Naming through cbind allows for more complex cases than list, and avoids post-hoc operations that rely on knowing the order of the output columns in advance and might become outdated if the formula is modified. – Lucianolucias 16/11, 2022 at 23:13

-4

w <- data.frame(Funding<-"Fully Insured",Region="North East",claim_count=rnbinom(1000, 300.503572818, mu= 0.5739467))
x <- data.frame(Funding<-"Fully Insured",Region="South East",claim_count=rnbinom(1000, 1000, mu= 0.70000000))
y <- data.frame(Funding<-"Self Insured",Region="North East",claim_count=rnbinom(1000, 400, mu= 0.80000000))
z <- data.frame(Funding<-"Self Insured",Region="South East",claim_count=rnbinom(1000, 700, mu= 1.70000000))
names(w)<-c("Funding","Region","claim_count")
names(x)<-c("Funding","Region","claim_count")
names(y)<-c("Funding","Region","claim_count")
names(z)<-c("Funding","Region","claim_count")
my_df <- rbind(w,x,y,z)
my_df2<-with(my_df, aggregate(x=claim_count, by=list(Funding,Region), FUN=sum))
colnames(my_df2)<-colnames(my_df)

Flutter answered 18/8, 2015 at 23:7 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update

Recommended topics

Hot tags