I have a dataframe holding multiple timeseries and I want to fit a line on each one and return the slope and intercept, and I want to do it in an expanding way: so first compute the intercept at the first time point, then at the second, ..., all the way to the last.
This is how the dataframe looks like:
>>> df
S1 S2 S3
Date
2019-01-02 0.019552 0.021958 0.001586
2019-01-03 0.000000 0.020325 0.000000
2019-01-04 0.007540 0.023239 0.001586
2019-01-07 0.007124 0.026128 0.002945
2019-01-08 0.010660 0.026203 0.002266
... ... ... ...
2024-06-28 0.918257 0.467164 0.909070
2024-07-01 0.950133 0.497262 0.914810
2024-07-02 0.968436 0.551025 0.902500
2024-07-03 0.975092 0.589036 0.944868
2024-07-05 1.000000 0.601924 0.926365
In my understanding, Pandas expanding()
can't return more than one scalar (ref, or I misunderstand how to use the method=table
), so I do:
minimum_data_samples = 10
def fit_linear_refression(signal_column):
if signal_column.empty or len(signal_column) < minimum_data_samples:
return np.nan, np.nan, np.nan, np.nan
X = np.arange(len(signal_column)).reshape(-1, 1)
reg = LinearRegression(n_jobs=-1).fit(X, signal_column)
return reg.coef_, reg.intercept_
# fit once and return slopes
slopes = df.expanding().apply(lambda x: fit_linear_regression(x)[0])
# fit second time and return intercepts
intercepts = df.expanding().apply(lambda x: fit_linear_regression(x)[1])
# concatenate the two dataframes
slopes_and_intercepts = pd.concat([slopes, intercepts], keys=['slopes', 'residuals'], axis=1)
and looks like this (figuratively, numbers do not correspond to the above):
intercepts slopes
S1 S2 S3 S1 S2 S3
Date
2014-01-02 NaN NaN NaN NaN NaN NaN
2014-01-03 NaN NaN NaN NaN NaN NaN
2014-01-06 NaN NaN NaN NaN NaN NaN
2014-01-07 NaN NaN NaN NaN NaN NaN
2014-01-08 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2018-06-27 0.105235 -0.117916 0.063601 0.000141 0.001361 0.000162
2018-06-28 0.105115 -0.120374 0.063536 0.000142 0.001360 0.000162
2018-06-29 0.105001 -0.122802 0.063457 0.000142 0.001360 0.000162
2018-07-02 0.104905 -0.125193 0.063283 0.000143 0.001359 0.000162
2018-07-03 0.104788 -0.127579 0.062910 0.000143 0.001359 0.000163
This works but fitting the same model twice, once to get the slopes and once to get the residuals, feels a waste of compute and time. To somewhat avoid the second call I considered memoize
-ing fit_linear_regression
in expense of memory (or disk) but it still feels that there is a better way.
I wonder if there is a way to accomplish the above but with a single call to fit_linear_regression
.
The situation shares similarities with that in Pandas' expanding with apply function on multiple columns and it differs in three points:
- here,
X
is derived from the index - each column in the input is a separate timeseries that we need to fit
- the output doesn't have to be added to the input dataframe and can be a new dataframe