If I'm interpreting the question correctly, you have a large DataFrame, with one or more source data columns and then multiple columns with windowed summary statistics based on the source columns. You're trying to update the bottom of the windowed summary columns after appending new rows to your source data columns without re-calculating the entire summary column.
The way to approach this will depend on a number of things, including whether you're using centered windows or not. But hopefully this gets you started.
I'll start with a toy version of your problem, with a single source
column and two windowed means:
In [2]: df = pd.DataFrame({'source': np.arange(0, 20, 2)})
In [3]: for window in [3, 5]:
...: df[f'rolling_mean_{window}'] = (
...: df.source.rolling(window, center=True).mean())
...:
Then we append a new row to the bottom:
In [4]: df = df.append(pd.Series({'source': 100}), ignore_index=True)
In [5]: df
Out[5]:
source rolling_mean_3 rolling_mean_5
0 0.0 NaN NaN
1 2.0 2.0 NaN
2 4.0 4.0 4.0
3 6.0 6.0 6.0
4 8.0 8.0 8.0
5 10.0 10.0 10.0
6 12.0 12.0 12.0
7 14.0 14.0 14.0
8 16.0 16.0 NaN
9 18.0 NaN NaN
10 100.0 NaN NaN
The amount of data we have to update depends on the length of the window. For example, for to update rolling_mean_3
we need to update the last two rows, using information from the last five rows. To be safe, we can re-calculate the last 2*window
rows plus the number of rows you added:
In [6]: df.source.iloc[-(2*window+1):].rolling(window, center=True).mean()
Out[6]:
4 NaN
5 10.000000
6 12.000000
7 14.000000
8 16.000000
9 44.666667
10 NaN
Name: source, dtype: float64
This has the correct data for rows 5-10. Note that row 4 is not correct in this version (it's now NaN
), but we can use this result to only update the last [-(window+1):]
rows. Here's the full solution:
In [7]: updated_rows = 1
In [8]: for window in [3, 5]:
...: update_column_name = f'rolling_mean_{window}'
...: update_column_index = df.columns.get_loc(update_column_name)
...: df.iloc[-(window+updated_rows):, update_column_index] = (
...: df.source
...: .iloc[-(window*2+updated_rows):]
...: .rolling(window, center=True).mean()
...: .iloc[-(window+updated_rows):]
...: )
In [9]: df
Out[9]:
source rolling_mean_3 rolling_mean_5
0 0.0 NaN NaN
1 2.0 2.000000 NaN
2 4.0 4.000000 4.0
3 6.0 6.000000 6.0
4 8.0 8.000000 8.0
5 10.0 10.000000 10.0
6 12.0 12.000000 12.0
7 14.0 14.000000 14.0
8 16.0 16.000000 32.0
9 18.0 44.666667 NaN
10 100.0 NaN NaN
This now has been updated to have a correctly computed tail.
Technically, for a centered rolling operation, you only need to update the last floor(window/2)+updated_rows
rows, drawing from the last window+updated_rows
rows of the dataframe. So you could do this to really optimize things.
If you're producing rolling statistics that aren't centered, the approach should be the same, but don't include the centered flag.