Optimized linear regression over multiple timeseries and expanding time window in a Pandas dataframe
Asked Answered
T

0

0

I have a dataframe holding multiple timeseries and I want to fit a line on each one and return the slope and intercept, and I want to do it in an expanding way: so first compute the intercept at the first time point, then at the second, ..., all the way to the last.

This is how the dataframe looks like:

>>> df
                  S1        S2        S3
Date
2019-01-02  0.019552  0.021958  0.001586
2019-01-03  0.000000  0.020325  0.000000
2019-01-04  0.007540  0.023239  0.001586
2019-01-07  0.007124  0.026128  0.002945
2019-01-08  0.010660  0.026203  0.002266
...              ...       ...       ...
2024-06-28  0.918257  0.467164  0.909070
2024-07-01  0.950133  0.497262  0.914810
2024-07-02  0.968436  0.551025  0.902500
2024-07-03  0.975092  0.589036  0.944868
2024-07-05  1.000000  0.601924  0.926365

In my understanding, Pandas expanding() can't return more than one scalar (ref, or I misunderstand how to use the method=table), so I do:

minimum_data_samples = 10

def fit_linear_refression(signal_column):
  if signal_column.empty or len(signal_column) < minimum_data_samples:
    return np.nan, np.nan, np.nan, np.nan
 
  X = np.arange(len(signal_column)).reshape(-1, 1)
  reg = LinearRegression(n_jobs=-1).fit(X, signal_column)
  return reg.coef_, reg.intercept_

# fit once and return slopes
slopes = df.expanding().apply(lambda x: fit_linear_regression(x)[0])

# fit second time and return intercepts
intercepts = df.expanding().apply(lambda x: fit_linear_regression(x)[1])

# concatenate the two dataframes
slopes_and_intercepts = pd.concat([slopes, intercepts], keys=['slopes', 'residuals'], axis=1)

and looks like this (figuratively, numbers do not correspond to the above):

           intercepts                        slopes
                   S1        S2        S3        S1        S2        S3
Date
2014-01-02        NaN       NaN       NaN       NaN       NaN       NaN
2014-01-03        NaN       NaN       NaN       NaN       NaN       NaN
2014-01-06        NaN       NaN       NaN       NaN       NaN       NaN
2014-01-07        NaN       NaN       NaN       NaN       NaN       NaN
2014-01-08        NaN       NaN       NaN       NaN       NaN       NaN
...              ...       ...       ...       ...       ...       ...
2018-06-27   0.105235 -0.117916  0.063601  0.000141  0.001361  0.000162
2018-06-28   0.105115 -0.120374  0.063536  0.000142  0.001360  0.000162
2018-06-29   0.105001 -0.122802  0.063457  0.000142  0.001360  0.000162
2018-07-02   0.104905 -0.125193  0.063283  0.000143  0.001359  0.000162
2018-07-03   0.104788 -0.127579  0.062910  0.000143  0.001359  0.000163

This works but fitting the same model twice, once to get the slopes and once to get the residuals, feels a waste of compute and time. To somewhat avoid the second call I considered memoize-ing fit_linear_regression in expense of memory (or disk) but it still feels that there is a better way.

I wonder if there is a way to accomplish the above but with a single call to fit_linear_regression.

The situation shares similarities with that in Pandas' expanding with apply function on multiple columns and it differs in three points:

  1. here, X is derived from the index
  2. each column in the input is a separate timeseries that we need to fit
  3. the output doesn't have to be added to the input dataframe and can be a new dataframe
Toadeater answered 8/7, 2024 at 16:50 Comment(1)
It's an interesting idea, it certainly solves the double computation. Combining it with that of @Piere D may be the solution.Toadeater

© 2022 - 2025 — McMap. All rights reserved.