Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

I am performing component wise regression on a time series data. This is basically where instead of regressing y against x₁, x₂, ..., x_N, we would regress y against x₁ only, y against x₂ only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against x_i (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.

The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.

Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against x_i linear regression as such. The z_matrix is the data matrix and the i represents the i^th column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500.

    ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit()
    return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]

I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower:

    # slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions
    params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog)
    y_hat = self.z_matrix[:, [i]]@params
    ssr = sum((y_hat-endog)**2)
    return params, ssr, y_hat

Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!

import numpy as np from statsmodels.regression.linear_model import OLS import datetime gen = np.random.default_rng(20210514) # Number of observations n = 1000 # Number of predictors m = 1000 # Window size w = 100 # Simulate data y = gen.standard_normal((n, 1)) x = gen.standard_normal((n, m)) now = datetime.datetime.now() # Compute rolling covariance and variance-like terms # These assume the model is y = x*b + e w/o a constant c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)] v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)] c_trimmed = c[w:] - c[:-w] v_trimmed = v[w:] - v[:-w] # Parameters are just the ratio params = c_trimmed / v_trimmed # Build a selector array to quickly reshape y and the columns of x step = np.arange(m - w + 1) sel = np.arange(w) locs = step[:, None] + sel # Get the blocked reshape of y. It has n - w + 1 rows with window observations # and looks like # [[y[0],y[1],...,y[99]], # [y[1],y[2],...,y[100]], # ..., # [y[900],y[901],...,y[999]], y_block = y[locs, 0] # Storage for the predicted values and the sse y_pred = np.empty((x.shape[1],) + y_block.shape) sse = np.empty((m - w + 1, n)) # Easiest to loop over columns. # Could do broadcasting tricks, but noth worth the trouble since number of columns is modest for i in range(x.shape[0]): # Reshape a columns of x like y x_block = x[locs, i] # Get the parameters and make sure it is 2d with shape (m-w+1, 1) # so the broadcasting works p = params[:, i][:, None] # Get the predicted value y_pred[i] = x_block * p # And the sse sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1) print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s") # Some test code # Test any single observation start = 124 assert start <= m - w column = 342 assert column < x.shape[1] res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit() np.testing.assert_allclose(res.params[0], params[start, column]) np.testing.assert_allclose(res.fittedvalues, y_pred[column, start]) np.testing.assert_allclose(res.ssr, sse[start, column])

Recommended topics

Hot tags