I have a time-series in a Pandas DataFrame. The timestamps can be uneven (one every 1-5 minutes), but there will always be one every 5 minutes (timestamps with minutes ending in 0,5,10,15,20,25,30,35,40,45,50,55).
Example:
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
I want to get the time-weighted averages of blocks of 15 minutes. The rows with a time stamp that is directly on a 15 minute mark (timestamps with minutes ending in 0,15,30,45) end an interval, so the grouping is as follows:
Group 1 (interval 2017-01-01 2:00:00):
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
Group 2 (interval 2017-01-01 2:15:00):
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
Group 3 (interval 2017-01-01 2:30:00):
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
Group 4 (interval 2017-01-01 2:45:00):
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
The average MUST be time-weighted, so not just a standard average of all values in a group.
For instance, the time-weighted average of group 2 is NOT 72.785, which is the regular average of all 4 values. Rather, it should be:
(5 minutes / 15 minutes) * 17.33 = 5.776667 ==> The 5 minutes is taken from the difference between this timestamp and the previous timestamp
+(5 minutes / 15 minutes) * 23.40 = 7.8
+(3.75 minutes / 15 minutes) * 150.12 = 37.53
+(1.25 minutes / 15 minutes) * 100.29 = 8.3575
= **59.46417**
Also ideally, the 15 minutes is parameterized, as this might change in the future to 60 minutes (hourly), but I don't think this is an issue here.
Also, performance is quite important in this. As my dataset will have around 10k rows, so iterating over each record one by one will be pretty slow.
I tried looking into the df.rolling() function of Pandas, but couldn't figure out how to apply it directly to my specific scenario.
Thanks a lot for the help!
UPDATE 1:
Following Simon's brilliant solution, I modified it a little bit.
I made a few tweaks to it to adapt it to my specific case:
def func(df):
if df.size == 0: return
timestep = 15*60
indexes = df.index - (df.index[-1] - pd.Timedelta(seconds=timestep))
seconds = indexes.seconds
weight = [seconds[n]/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
for n, k in enumerate(seconds)]
return np.sum(weight*df.values)
This is to cope with possibly empty 15 minute intervals (missing rows in the DB)