Grouped moving average in r
Asked Answered
L

3

3

I'm trying to calculate a moving average in r over a particular field BUT I need this moving average to be grouped by two or more other fields. The purpose of this new average is for predictive analysis so I need it to be trailing as well. Any variables that do not have enough values to be averaged (such as student J) would ideally give either NA or its original Score value.

I've been trying rollapply and data.table and am having no luck!

I've provided the table of data and two moving averages (AVG2 with k=2 and AVG3 with k=3) to show exactly what I'm after. The moving average is on Score and the variables to group over are school, Student and area. Please help!

   no   school  Student area    Score **AVG2**  **AVG3**
   1    I       S       A       5      NA       NA
   2    B       S       A       2      NA       NA
   3    B       S       A       7      NA       NA
   4    B       O       A       3      NA       NA
   5    B       O       B       9      NA       NA
   6    I       O       A       6      NA       NA
   7    I       O       B       3      NA       NA
   8    I       S       A       7      NA       NA
   9    I       O       A       1      NA       NA
   10   B       S       A       7      4.5      NA
   11   I       S       A       3      NA       NA
   12   I       O       A       8      3.5      NA
   13   B       S       A       3      7        5.33
   14   I       O       A       4      4.5      5
   15   B       O       A       1      NA       NA
   16   I       S       A       9      5        5
   17   B       S       A       4      5        5.67
   18   B       O       A       6      2        NA
   19   I       S       A       3      6        6.33
   20   I       O       B       8      NA       NA
   21   B       S       A       3      3.5      4.67
   22   I       O       A       4      6        4.33
   23   B       O       A       1      3.5      3.33
   24   I       S       A       9      6        5
   25   B       S       A       4      3.5      3.33
   26   B       O       A       6      3.5      2.67
   27   I       J       A       6      NA       NA

here is the code to recreate the initial table in r:

school <- c('I','B','B','B','B','I','I','I','I','B','I','I','B','I','B','I','B','B','I','I','B','I','B','I','B','B','I')
Student <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
area <- c('A','A','A','A','B','A','B','A','A','A','A','A','A','A','A','A','A','A','A','B','A','A','A','A','A','A','A')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(school, Student, area,  Score)
Lupine answered 7/2, 2016 at 2:26 Comment(3)
Can you provide a sample R code, of how you have calculated AVG2 And AVG3?Epigraphy
First, instead of providing a picture of your data please provide your data in a format that we can easily read into R, e.g. posting the output of dput(dat) or at least head(dat) in your question. Second, please share the desired output of your grouped moving average for this data. Third, please share the code you've tried so far.Gorden
I used excel to manually calculate AVG2 and AVG3 since I've had no luck with r. The desired output is the displayed AVG2 and AVG3Lupine
T
2

Here is a rollapply solution. Note that it appears that you want the average of the prior two or three rows in the same group, i.e. excluding the data on the current row.

library(zoo)

roll <- function(x, n) { 
   if (length(x) <= n) NA 
   else rollapply(x, list(-seq(n)), mean, fill = NA)
}
transform(DF, AVG2 = ave(Score, school, Student, FUN = function(x) roll(x, 2)),
              AVG3 = ave(Score, school, Student, FUN = function(x) roll(x, 3)))

giving:

   school Student Score AVG2     AVG3
1       I       S     5   NA       NA
2       B       S     2   NA       NA
3       B       S     7   NA       NA
4       B       O     3   NA       NA
5       B       O     9   NA       NA
6       I       O     6   NA       NA
7       I       O     3   NA       NA
8       I       S     7   NA       NA
9       I       O     1  4.5       NA
10      B       S     7  4.5       NA
11      I       S     3  6.0       NA
12      I       O     8  2.0 3.333333
13      B       S     3  7.0 5.333333
14      I       O     4  4.5 4.000000
15      B       O     1  6.0       NA
16      I       S     9  5.0 5.000000
17      B       S     4  5.0 5.666667
18      B       O     6  5.0 4.333333
19      I       S     3  6.0 6.333333
20      I       O     8  6.0 4.333333
21      B       S     3  3.5 4.666667
22      I       O     4  6.0 6.666667
23      B       O     1  3.5 5.333333
24      I       S     9  6.0 5.000000
25      B       S     4  3.5 3.333333
26      B       O     6  3.5 2.666667
27      I       J     6   NA       NA

Update: Fixed roll.

Transgression answered 9/2, 2016 at 5:34 Comment(8)
How can I change this if I have another column to group by? I tried this method on a different dataset with more grouping fields and was getting the error: Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argumentLupine
Say there is a third grouping field, 'area' for the above example. So the moving average will now need to be grouped over school, Student and also area. How would the code change to account for 3 groups rather than 2?Lupine
Thanks for the response. Unfortunately when I do as you suggest, I get the following error: Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argumentLupine
I've edited the initial question to now include the extra 'area' field to be grouped over alongside new values for AVG2 and AVG3. If you could show me how you can do it I'd very much appreciate it!Lupine
Try: AVG2 = ave(Score, school, Student, area, FUN = function(x) roll(x, 2)). This is erroring for me now. n = 2 doesn't work, n = 3 does, n = 4 does, n = 5 doesn'tLupine
But your AVG2 isn't giving the same result as what I need. It has n = 3, rather than n = 2. Your AVG2 and AVG3 are exactly the sameLupine
OK. Have fixed roll. < should have been <= .Transgression
Thank you so much! You are a heroLupine
M
3

You can try solving the problem using dplyr and TTR but for student J from school I it is not possible to calculate a moving average as there's only one measurement.

AVG2 caluculated with stats:filter gives the result you wanted to have, but I also added AVG2b calculated with TTR::SMA to show a simple moving average calculation, where the current measurement is also taken into account.

library(dplyr)
library(TTR)

df <- data.frame(school, Student, Score)
df$AVG2 <- NA
df$AVG2b <- NA
df[!(df$school=="I" & df$Student=="J"),] <- df[!(df$school=="I" & df$Student=="J"),] %>% 
  group_by(school, Student) %>% 
  mutate(AVG2 = stats::filter(Score, c(0, 0.5, 0.5), sides = 1 ), AVG2b = SMA(Score, n= 2)) 

    > df
   school Student Score AVG2 AVG2b
1       I       S     5   NA    NA
2       B       S     2   NA    NA
3       B       S     7   NA   4.5
4       B       O     3   NA    NA
5       B       O     9   NA   6.0
6       I       O     6   NA    NA
7       I       O     3   NA   4.5
8       I       S     7   NA   6.0
9       I       O     1  4.5   2.0
10      B       S     7  4.5   7.0
...
Me answered 7/2, 2016 at 12:19 Comment(8)
My mistake, the manually calculated values were incorrect. I've amended them, do your results match now?Lupine
No, but you can also check yourself, or is the code not working for you? If your are using a simple moving average I think the value for the third row for AVG2 should be 4.5 and for the 5th row 6.0 and so onMe
Im not sure this gives the trailing moving average that I'm after, the first 8 rows should be NA since there hasn't been a third value for both groups until row 9 (school I and student O).Lupine
I think now I understood how you are calculating the AVG2: For the third measurement of a group you take the previous 2 and take the mean from them. Moving average takes also the current measurement into account, if you don't want this you can use stats::filter(Score, c(0, 0.5, 0.5), sides = 1) instead of the SMA function...Me
Is there any way I can do this and get r to just output student J's actual score since he only has 1 value. Maybe so that the first two values of any two group combo outputs as the actual score until the third appears and the averaging is applied? Appreciate your help so far.Lupine
What is the code to do this for a moving average other than 2? A moving window of 3 or 4 for example?Lupine
I edited my answer a little bit so that all the original information is kept in df and then you can set the value for the group which is too small to the score (subsetting and replace). For a moving average with a larger window, you should modify the filter in stats::filter accordingly, see ?stats::filter for more help.Me
Thanks for that. Do you have a more generic way to handle this when you are averaging over a group that is too small? The dataset I gave was a subset of a much large dataset that will have multiple grouping combinations that are too small. If there was a way to avoid the "Error: 'filter' is longer than time series" and just have 'NA's it would make my life so much easier!!Lupine
T
2

Here is a rollapply solution. Note that it appears that you want the average of the prior two or three rows in the same group, i.e. excluding the data on the current row.

library(zoo)

roll <- function(x, n) { 
   if (length(x) <= n) NA 
   else rollapply(x, list(-seq(n)), mean, fill = NA)
}
transform(DF, AVG2 = ave(Score, school, Student, FUN = function(x) roll(x, 2)),
              AVG3 = ave(Score, school, Student, FUN = function(x) roll(x, 3)))

giving:

   school Student Score AVG2     AVG3
1       I       S     5   NA       NA
2       B       S     2   NA       NA
3       B       S     7   NA       NA
4       B       O     3   NA       NA
5       B       O     9   NA       NA
6       I       O     6   NA       NA
7       I       O     3   NA       NA
8       I       S     7   NA       NA
9       I       O     1  4.5       NA
10      B       S     7  4.5       NA
11      I       S     3  6.0       NA
12      I       O     8  2.0 3.333333
13      B       S     3  7.0 5.333333
14      I       O     4  4.5 4.000000
15      B       O     1  6.0       NA
16      I       S     9  5.0 5.000000
17      B       S     4  5.0 5.666667
18      B       O     6  5.0 4.333333
19      I       S     3  6.0 6.333333
20      I       O     8  6.0 4.333333
21      B       S     3  3.5 4.666667
22      I       O     4  6.0 6.666667
23      B       O     1  3.5 5.333333
24      I       S     9  6.0 5.000000
25      B       S     4  3.5 3.333333
26      B       O     6  3.5 2.666667
27      I       J     6   NA       NA

Update: Fixed roll.

Transgression answered 9/2, 2016 at 5:34 Comment(8)
How can I change this if I have another column to group by? I tried this method on a different dataset with more grouping fields and was getting the error: Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argumentLupine
Say there is a third grouping field, 'area' for the above example. So the moving average will now need to be grouped over school, Student and also area. How would the code change to account for 3 groups rather than 2?Lupine
Thanks for the response. Unfortunately when I do as you suggest, I get the following error: Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argumentLupine
I've edited the initial question to now include the extra 'area' field to be grouped over alongside new values for AVG2 and AVG3. If you could show me how you can do it I'd very much appreciate it!Lupine
Try: AVG2 = ave(Score, school, Student, area, FUN = function(x) roll(x, 2)). This is erroring for me now. n = 2 doesn't work, n = 3 does, n = 4 does, n = 5 doesn'tLupine
But your AVG2 isn't giving the same result as what I need. It has n = 3, rather than n = 2. Your AVG2 and AVG3 are exactly the sameLupine
OK. Have fixed roll. < should have been <= .Transgression
Thank you so much! You are a heroLupine
C
1

Here is AVG2 calculation with data.table, which is faster compared to other approaches:

library(data.table)
dt <- data.table(df)
setkey(dt, school, Student, area)
dt[, c("start", "len") := .(ifelse(.I + 1 > .I[.N], 0, .I +1), pmax(pmin(1, .I[.N] - .I -1), 0)), by = .(school, Student, area)][
    , AVG2 := mean(dt$Score[start:(start+len)]), by = 1:nrow(dt)]
res$AVG2[res$len == 0] <- NA
Cocker answered 8/5, 2016 at 22:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.