I am performing component wise regression on a time series data. This is basically where instead of regressing y against x1, x2, ..., xN, we would regress y against x1 only, y against x2 only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against xi (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.
The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.
Currently, I am using statsmodels.OLS.fit()
as the way to get my parameters for each y against xi linear regression as such. The z_matrix
is the data matrix and the i
represents the ith column to slice for the regression. The number of rows is about 100 and z_matrix
is about size 100 × 500.
ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit()
return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]
I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv
which is unfortunately slower:
# slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions
params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog)
y_hat = self.z_matrix[:, [i]]@params
ssr = sum((y_hat-endog)**2)
return params, ssr, y_hat
Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!