Aggregate and Weighted Mean in R

Asked 29/7, 2010 at 21:43 Answered 3/9, 2020 at 8:11

I'm trying to calculate asset-weighted returns by asset class. For the life of me, I can't figure out how to do it using the aggregate command.

My data frame looks like this

dat <- data.frame(company, fundname, assetclass, return, assets)

I'm trying to do something like (don't copy this, it's wrong):

aggregate(dat, list(dat$assetclass), weighted.mean, w=(dat$return, dat$assets))

Mould answered 29/7, 2010 at 21:43 Comment(0)

For starters, w=(dat$return, dat$assets)) is a syntax error.

And plyr makes this a little easier:

> set.seed(42)   # fix seed so that you get the same results
> dat <- data.frame(assetclass=sample(LETTERS[1:5], 20, replace=TRUE), 
+                   return=rnorm(20), assets=1e7+1e7*runif(20))
> library(plyr)
> ddply(dat, .(assetclass),   # so by asset class invoke following function
+       function(x) data.frame(wret=weighted.mean(x$return, x$assets)))
  assetclass     wret
1          A -2.27292
2          B -0.19969
3          C  0.46448
4          D -0.71354
5          E  0.55354
>

Rus answered 29/7, 2010 at 22:0 Comment(3)

It works like a charm. The first time I tried it, I replaced x in the function with dat (returning the same number for each asset class). Any idea why this won't work with the aggregate command? – Mould 30/7, 2010 at 5:1

It seems aggregate aggregates every column wheres you desire computation over two columns. I think a while ago I use doBy or something like it -- but hey, plyr makes it easier and has other bells and whistles. – Rus 30/7, 2010 at 11:38

I knew this would happen ;-) Thanks for waving the cluebat. – Rus 30/7, 2010 at 17:34

A data.table solution, will be faster than plyr

library(data.table)
DT <- data.table(dat)
DT[,list(wret = weighted.mean(return,assets)),by=assetclass]
##    assetclass        wret
## 1:          A -0.05445455
## 2:          E -0.56614312
## 3:          D -0.43007547
## 4:          B  0.69799701
## 5:          C  0.08850954

Easiness answered 10/9, 2012 at 6:50 Comment(1)

I've been meaning to look at data.table for some time now. With growing databases I guess it's about time. Appreciate the pointer! – Mould 10/9, 2012 at 18:12

This is also easily done with aggregate. It helps to remember alternate equations for a weighted mean.

rw <- dat$return * dat$assets
dat1 <- aggregate(rw ~ assetclass, data = dat, sum)
datw <- aggregate(assets ~ assetclass, data = dat, sum)
dat1$weighted.return <- dat1$rw / datw$assets

Dray answered 30/3, 2015 at 22:19 Comment(4)

wait a second, this doesn't make sense. You're essentially dividing the mean by the mean number of samples. Weighted mean is the sum of weighted values divided by the sum of weights – Decumbent 25/1, 2016 at 21:39

There's a variety of ways to do weights and in this case it is the number of samples so it's correct. There is more than one equation for weighting. Run it on Dirk's data. – Dray 25/1, 2016 at 21:53

The use of mean() confused me here. It could be sum(); using mean() implies that we're dividing the numerator and denominator by nlevels(assetclass) (for very large datasets the mean might be numerically more stable though) – Geest 30/11, 2017 at 1:15

Yes, for the two aggregates this could be rewritten using sum. It comes out the same because the ratio of two sums over equal lengths equals the ratio of two means (obviously the denominators in the mean are constant). Looking back I'm not even sure why I put mean in there so I've changed it to sums. I might have been looking through a book of weighted mean equations and just selected an alternative that works. – Dray 30/11, 2017 at 14:51

The recently released collapse package provides a fast solution to this and similar problems (using weighted median, mode etc.) by providing a full set of Fast Statistical Functions performing grouped and weighted computations internally in C++:

library(collapse)
dat <- data.frame(assetclass = sample(LETTERS[1:5], 20, replace = TRUE), 
                  return = rnorm(20), assets = 1e7+1e7*runif(20))

# Using collap() function with fmean, which supports weights: (by default weights are aggregated using the sum, which is prevented using keep.w = FALSE)
collap(dat, return ~ assetclass, fmean, w = ~ assets, keep.w = FALSE)
##   assetclass     return
## 1          A -0.4667822
## 2          B  0.5417719
## 3          C -0.8810705
## 4          D  0.6301396
## 5          E  0.3101673

# Can also use a dplyr-like workflow: (use keep.w = FALSE to omit sum.assets)
library(magrittr)
dat %>% fgroup_by(assetclass) %>% fmean(assets)
##   assetclass sum.assets     return
## 1          A   80683025 -0.4667822
## 2          B   27411156  0.5417719
## 3          C   22627377 -0.8810705
## 4          D  146355734  0.6301396
## 5          E   25463042  0.3101673

# Or simply a direct computation yielding a vector:
dat %$% fmean(return, assetclass, assets)
##          A          B          C          D          E 
## -0.4667822  0.5417719 -0.8810705  0.6301396  0.3101673

Upshaw answered 3/9, 2020 at 8:11 Comment(0)

Recommended topics

Hot tags