pandas rolling window effective add new rows
Asked Answered
P

2

6

I have a pretty big dataset (around 5 million rows) with several calculated columns like lags (1 and 7) and rolling windows (7, 30, 90 with several values each, like mean, std, min, max and so on). Now I need to add one or more rows to the df and I was wondering the most effective way to (re)calculate those features. Recalculating the whole df would take too much time, but I cannot simply apply the functions to the new added row.

For the lags it's not a big issue, I can simply, for example for lag 1,

df.iloc[-1, -2] = df.iloc[-2, -2]

and this should solve the problem (the same for lag 7), but what about rolling windows? The same approach won't work and I cannot afford to (re)run the rolling windows on the whole dataframe. Shall I make the calculations by hand? Any other way?

Python 3.7.7 and pandas 1.0.3

Pilate answered 30/5, 2020 at 20:35 Comment(1)
Can you post what you've tried and why it's not working? See this post on how to ask a question for some best practices. There are lots of ways to do these things, and plenty more ways for it to go wrong, so we need to know where you're starting from. Thanks!Pestalozzi
P
8

If I'm interpreting the question correctly, you have a large DataFrame, with one or more source data columns and then multiple columns with windowed summary statistics based on the source columns. You're trying to update the bottom of the windowed summary columns after appending new rows to your source data columns without re-calculating the entire summary column.

The way to approach this will depend on a number of things, including whether you're using centered windows or not. But hopefully this gets you started.

I'll start with a toy version of your problem, with a single source column and two windowed means:

In [2]: df = pd.DataFrame({'source': np.arange(0, 20, 2)})

In [3]: for window in [3, 5]:
   ...:     df[f'rolling_mean_{window}'] = (
   ...:         df.source.rolling(window, center=True).mean())
   ...:

Then we append a new row to the bottom:

In [4]: df = df.append(pd.Series({'source': 100}), ignore_index=True)

In [5]: df
Out[5]:
    source  rolling_mean_3  rolling_mean_5
0      0.0             NaN             NaN
1      2.0             2.0             NaN
2      4.0             4.0             4.0
3      6.0             6.0             6.0
4      8.0             8.0             8.0
5     10.0            10.0            10.0
6     12.0            12.0            12.0
7     14.0            14.0            14.0
8     16.0            16.0             NaN
9     18.0             NaN             NaN
10   100.0             NaN             NaN

The amount of data we have to update depends on the length of the window. For example, for to update rolling_mean_3 we need to update the last two rows, using information from the last five rows. To be safe, we can re-calculate the last 2*window rows plus the number of rows you added:

In [6]: df.source.iloc[-(2*window+1):].rolling(window, center=True).mean()
Out[6]:
4           NaN
5     10.000000
6     12.000000
7     14.000000
8     16.000000
9     44.666667
10          NaN
Name: source, dtype: float64

This has the correct data for rows 5-10. Note that row 4 is not correct in this version (it's now NaN), but we can use this result to only update the last [-(window+1):] rows. Here's the full solution:

In [7]: updated_rows = 1
In [8]: for window in [3, 5]:
   ...:     update_column_name = f'rolling_mean_{window}'
   ...:     update_column_index = df.columns.get_loc(update_column_name)
   ...:     df.iloc[-(window+updated_rows):, update_column_index] = (
   ...:         df.source
   ...:         .iloc[-(window*2+updated_rows):]
   ...:         .rolling(window, center=True).mean()
   ...:         .iloc[-(window+updated_rows):]
   ...:     )

In [9]: df
Out[9]:
    source  rolling_mean_3  rolling_mean_5
0      0.0             NaN             NaN
1      2.0        2.000000             NaN
2      4.0        4.000000             4.0
3      6.0        6.000000             6.0
4      8.0        8.000000             8.0
5     10.0       10.000000            10.0
6     12.0       12.000000            12.0
7     14.0       14.000000            14.0
8     16.0       16.000000            32.0
9     18.0       44.666667             NaN
10   100.0             NaN             NaN

This now has been updated to have a correctly computed tail.

Technically, for a centered rolling operation, you only need to update the last floor(window/2)+updated_rows rows, drawing from the last window+updated_rows rows of the dataframe. So you could do this to really optimize things.

If you're producing rolling statistics that aren't centered, the approach should be the same, but don't include the centered flag.

Pestalozzi answered 30/5, 2020 at 21:48 Comment(1)
Thanks a lot, it seems exactly what I was looking for. I will try as soon as possible but it looks fine. Yes the windows are centred and my problem with a rolling window on the last columns was exactly that it would have filled the first row with NaNs but your solution is brilliant!Pilate
O
0

You just need to estimate mean over the last set of items. See below

updated_rows = 1
for window in [3, 5]:
    update_column_name = f'rolling_mean_{window}'
    update_column_index = df.columns.get_loc(update_column_name)
    df.iloc[-(updated_rows):, update_column_index] = df.source.iloc[-(window):].mean()
Opportune answered 3/11, 2020 at 1:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.