Python - Time-weighted average Pandas, grouped by time interval
Asked Answered
P

5

5

I have a time-series in a Pandas DataFrame. The timestamps can be uneven (one every 1-5 minutes), but there will always be one every 5 minutes (timestamps with minutes ending in 0,5,10,15,20,25,30,35,40,45,50,55).

Example:

2017-01-01 2:05:00   32.90
2017-01-01 2:07:30   29.83
2017-01-01 2:10:00   45.76
2017-01-01 2:15:00   16.22
2017-01-01 2:20:00   17.33
2017-01-01 2:25:00   23.40
2017-01-01 2:28:45   150.12
2017-01-01 2:30:00   100.29
2017-01-01 2:35:00   38.45
2017-01-01 2:40:00   67.12
2017-01-01 2:45:00   20.00
2017-01-01 2:50:00   58.41
2017-01-01 2:55:00   58.32
2017-01-01 3:00:00   59.89

I want to get the time-weighted averages of blocks of 15 minutes. The rows with a time stamp that is directly on a 15 minute mark (timestamps with minutes ending in 0,15,30,45) end an interval, so the grouping is as follows:

Group 1 (interval 2017-01-01 2:00:00):
    2017-01-01 2:05:00   32.90
    2017-01-01 2:07:30   29.83
    2017-01-01 2:10:00   45.76
    2017-01-01 2:15:00   16.22

Group 2 (interval 2017-01-01 2:15:00):
    2017-01-01 2:20:00   17.33
    2017-01-01 2:25:00   23.40
    2017-01-01 2:28:45   150.12
    2017-01-01 2:30:00   100.29

Group 3 (interval 2017-01-01 2:30:00):
    2017-01-01 2:35:00   38.45
    2017-01-01 2:40:00   67.12
    2017-01-01 2:45:00   20.00

Group 4 (interval 2017-01-01 2:45:00):
    2017-01-01 2:50:00   58.41
    2017-01-01 2:55:00   58.32
    2017-01-01 3:00:00   59.89

The average MUST be time-weighted, so not just a standard average of all values in a group.

For instance, the time-weighted average of group 2 is NOT 72.785, which is the regular average of all 4 values. Rather, it should be:

 (5 minutes / 15 minutes) * 17.33 = 5.776667     ==> The 5 minutes is taken from the difference between this timestamp and the previous timestamp
+(5 minutes / 15 minutes) * 23.40 = 7.8
+(3.75 minutes / 15 minutes) * 150.12 = 37.53
+(1.25 minutes / 15 minutes) * 100.29 = 8.3575

= **59.46417**

Also ideally, the 15 minutes is parameterized, as this might change in the future to 60 minutes (hourly), but I don't think this is an issue here.

Also, performance is quite important in this. As my dataset will have around 10k rows, so iterating over each record one by one will be pretty slow.

I tried looking into the df.rolling() function of Pandas, but couldn't figure out how to apply it directly to my specific scenario.

Thanks a lot for the help!

UPDATE 1:

Following Simon's brilliant solution, I modified it a little bit.

I made a few tweaks to it to adapt it to my specific case:

def func(df):
    if df.size == 0: return
    timestep = 15*60
    indexes = df.index - (df.index[-1] - pd.Timedelta(seconds=timestep))
    seconds = indexes.seconds
    weight = [seconds[n]/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
          for n, k in enumerate(seconds)]
    return np.sum(weight*df.values)

This is to cope with possibly empty 15 minute intervals (missing rows in the DB)

Pedigree answered 4/9, 2017 at 3:52 Comment(0)
T
11

This one was tricky. I would love to see another commenter do this more efficiently, since I have a hunch there is a better way to do this.

I also skipped one part, which is parameterizing the 15 minute value, but I point out how you might do it in the comment. This is left as an exercise for the reader :D It should be parameterized though, as it stands now there are a lot of random '*15' and '*60' values scattered around the place, which looks clumsy.

I'm also tired, and my wife wants to watch a movie, so I didn't clean up my code. It's a little messy, and should be written cleaner -- which may or may not be worth doing, depending on if someone else can redo this all in 6 lines of code. If tomorrow morning it is still unanswered, I'll go back through and do this better.

Updated better solution 1

def func(df):
    timestep = 15*60
    seconds = (df.index.minute*60+df.index.second)-timestep
    weight = [k/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
              for n, k in enumerate(seconds)]
    return np.sum(weight*df.values)

df.resample('15min', closed='right').apply(func)
Topnotch answered 4/9, 2017 at 5:30 Comment(3)
This is brilliant! Thank you very much, this is exactly what I needed! Is there a way to use GroupBy() instead of Resample()? The reason is I have another column I want to group by, which I didn't include in the original question for simplicity's sake. I seem to be table to use: df.groupby([pd.TimeGrouper(freq='15Min')]) But there doesn't seem to be a way to close the group with the right side, like the resample() function has.Pedigree
So basically, I have the following 4 columns in my table: "TIME | ZONE | PRICE1 | PRICE2" And I want to have a time-weighted average per zone and per 15 minute interval for each pricePedigree
I did some more tests with more data and the whole thing is very slow; maybe I'm just not used to Python's speed. To process 1.6 million rows (in 530k groups of ~3 rows each), it took about 10 minutes. I did the same in a C# program (the code was much longer since I had to iterate manually over each row) and it took less than 10 seconds.Pedigree
R
2

The accepted solution provided by simon assumes the timestamps are time ending and thus would not take into account the value(s) prior to the opening of each 15-minute window. Not a method of data sampling typically done for raw datasets. The correct solution for time beginning timestamps requires the insertion of samples at each 15th minute in the time series data and the carrying forward of the last sample from the prior 15-minute window and the time-weighted calculation requires a little more data sample manipulation.

For time beginning, the value 32.90 present at minute 2017-01-01 2:05:00 was not present for 5/15 of the 15-minute window as it was sampled at that minute for the period from 2017-01-01 2:05:00 to 2017-01-01 2:07:30 and thus has a weight of 2.5/15. We do not know what value was present for the first 5 minutes of the 15-minute window. The accepted solution also fails to calculate a time-weighted average for the periods where no value was recorded when in actuality the last value prior to the 15-minut window with no samples would be the time-weighted average for each of the following windows until a new value was posted to the dataset.

The correct solution requires forcing samples at each 15th minute in the time series that carries forward the last value from the prior 15-minute window.

series = concat(
    [
        series,
        Series(index=date_range(
            Timestamp('2017-01-01 2:15:00'), 
            Timestamp('2017-01-01 3:00:00'), 
            freq='15T'
        ), dtype=np.float64)
    ]
).sort_index()
series = series[~series.index.duplicated(keep='first')].ffill()

The provided data already has samples at each 15th minute but this should be done as to guarantee data integrity when resampling. The time-weighted average function for samples with time-beginning timestamps is as follow:

def fifteen_minute_weighted_average(series):
    seconds = np.append(
        np.mod(temp.index.minute.values, np.array([15] * len(temp.index))) * 60 + temp.index.second.values,
        15 * 60
    )
    weights = np.diff(seconds) / (15 * 60)
    values = temp.values
    return np.sum(weights * values)

This solution puts the time delta weight with the correct sample for time beginning samples. Then when you resample the data, make sure to use closed='left' instead of closed='right' in order to capture the starting sample in each minute as the ending sample is useless in the window because it did not occur in the window. Groupings provided by Vincent are not how the time-weighted averages require the data to be grouped for time-beginning timestamps.

Resampling to 15-minutes is as follows:

weighted_averages = series.loc[
    Timestamp('2017-01-01 2:15:00'):Timestamp('2017-01-01 3:00:00')
].resample('15T', closed='left').apply(fifteen_minute_weighted_average)

Redouble answered 22/9, 2022 at 23:35 Comment(0)
P
1

Another option is to multiply the values by the fractional time between the ticks and then add the results. The following function takes the series or dataframe with values and the requested index.:

import numpy as np
import pandas as pd


def resample_time_weighted_mean(x, target_index, closed=None, label=None):
    shift = 1 if closed == "right" else -1
    fill = "bfill" if closed == "right" else "ffill"
    # Determine length of each interval (daylight saving aware)
    extended_index = target_index.union(
        [target_index[0] - target_index.freq, target_index[-1] + target_index.freq]
    )
    interval_lengths = -extended_index.to_series().diff(periods=shift)

    # Create a combined index of the source index and target index and reindex to combined index
    combined_index = x.index.union(extended_index)
    x = x.reindex(index=combined_index, method=fill)
    interval_lengths = interval_lengths.reindex(index=combined_index, method=fill)

    # Determine weights of each value and multiply source values
    weights = -x.index.to_series().diff(periods=shift) / interval_lengths
    x = x.mul(weights, axis=0)

    # Resample to new index, the final reindex is necessary because resample 
    # might return more rows based on the frequency
    return (
        x.resample(target_index.freq, closed=closed, label=label)
        .sum()
        .reindex(target_index)
    )

Applying this to the sample data:

x = pd.Series(
    [
        32.9,
        29.83,
        45.76,
        16.22,
        17.33,
        23.4,
        150.12,
        100.29,
        38.45,
        67.12,
        20.0,
        58.41,
        58.32,
        59.89,
    ],
    index=pd.to_datetime(
        [
            "2017-01-01 2:05:00",
            "2017-01-01 2:07:30",
            "2017-01-01 2:10:00",
            "2017-01-01 2:15:00",
            "2017-01-01 2:20:00",
            "2017-01-01 2:25:00",
            "2017-01-01 2:28:45",
            "2017-01-01 2:30:00",
            "2017-01-01 2:35:00",
            "2017-01-01 2:40:00",
            "2017-01-01 2:45:00",
            "2017-01-01 2:50:00",
            "2017-01-01 2:55:00",
            "2017-01-01 3:00:00",
        ]
    ),
)

opts = dict(closed="right", label="right")
resample_time_weighted_mean(
    x, pd.DatetimeIndex(x.resample("15T", **opts).groups.keys(), freq="infer"), **opts
)

Which returns:

2017-01-01 02:15:00    18.005000
2017-01-01 02:30:00    59.464167
2017-01-01 02:45:00    41.856667
2017-01-01 03:00:00    58.873333
Freq: 15T, dtype: float64

Regarding the performance issues mentioned under simon's anwer, this method performs well on millions of rows as well as the weights are calculated at once instead of in a relatively slow python loop:

new_index = pd.date_range("2017-01-01", "2021-01-01", freq="1T")
new_index = new_index + pd.TimedeltaIndex(
    np.random.rand(*new_index.shape) * 60 - 30, "s"
)
values = pd.Series(np.random.rand(*new_index.shape), index=new_index)
print(values.shape)
(2103841,)


%%timeit
resample_time_weighted_mean(
    values, pd.date_range("2017-01-01", "2021-01-01", freq="15T"), closed="right"
)
4.93 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Potboiler answered 1/4, 2021 at 8:6 Comment(1)
This method is way faster than the answers by Draco, Simon. 0.16s vs 16.85s on my particular data (2,388,336 rows, 2 columns)—better than 100x speedup! And I suspect that will only improve as number of columns increases since .apply() loops over every column individually. Thanks for the nice generalized function too (works for both closed and label types, as well as both series and dataframes). Great answer!Cardona
G
0

Let the label of the first column be ts and the next coulmn's label value

def tws(df, lenght):
df['ts'] = pd.to_datetime(df['ts'])
interval =[0]
df1 = df
for i in range(1,len(df1)):
    interval.append(((df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).days * 24 * 60 +(df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).seconds)/60)
df1['time_interval']= interval

start = pd.to_datetime('2017-01-01 2:00:00')
TWS = []
ave = 0
for i in range(1, len(df1)+1):
    try:
        if  df1.loc[i, 'ts']<= (start+timedelta(minutes = lenght)):
            ave = ave+df1.loc[i, 'value']*df1.loc[i,'time_interval']


        else:
            TWS.append(ave/lenght)
            ave = df1.loc[i, 'value']*df1.loc[i,'time_interval']
            start = df1.loc[i-1,'ts']
    except  :
        TWS.append(ave/lenght)

return TWS

tws(df,15)

The output is the list of the weighted time average of each interval

Gert answered 5/11, 2019 at 18:36 Comment(0)
I
0

Tried all of these answers and I they all ended up breaking in certain situations.

This worked for me:

import pandas as pd

def resample_timeweighted(ts, freq="15T"):
    new_index = pd.date_range(start=ts.index[0], end=ts.index[-1], freq=freq)
    last_value = 0
    timeweighted_ts = pd.Series([])
    for t0, t1 in zip(new_index, new_index[1:]):
        s = ts.truncate(before=t0, after=t1)
        s = pd.concat([pd.Series(index=[t0], data=[last_value]), s])
        s = pd.concat([s, pd.Series(index=[t1], data=[s[-1]])])
        last_value = s[-1]
        delta_time = s.index[1:] - s.index[:-1]
        tw_mean = s[:-1].mul(delta_time.total_seconds()).sum() / (t1-t0).total_seconds()
        timeweighted_ts = pd.concat([timeweighted_ts, pd.Series(index=[t0], data=[tw_mean])])
    return timeweighted_ts

Given your input, it returns

2017-01-01 02:05:00    31.115000
2017-01-01 02:20:00    57.566667
2017-01-01 02:35:00    41.856667
dtype: float64
Impanel answered 4/11, 2022 at 17:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.