Deprecated rolling window option in OLS from Pandas to Statsmodels

Asked 19/5, 2016 at 8:22 Answered 6/12, 2020 at 3:41

Solved python pandas deprecated statsmodels

as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can't seem to find it. Pandas tells me doom is in the works:

FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  model = pd.ols(y=series_1, x=mmmm, window=50)

in fact, if you do something like:

import statsmodels.api as sm

model = sm.OLS(series_1, mmmm, window=50).fit()

print(model.summary())

you get results (window does not impair the running of the code) but you get only the parameters of the regression run on the entire period, not the series of parameters for each of the rolling period it should be supposed to work on.

Talented answered 19/5, 2016 at 8:22 Comment(0)

I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.

It has three core classes:

OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays
RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.

Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

An example:

import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS

# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"

syms = {
    "TWEXBMTH" : "usd", 
    "T10Y2YM" : "term_spread", 
    "GOLDAMGBD228NLBM" : "gold",
}

params = {
    "fq": "Monthly,Monthly,Monthly",
    "id": ",".join(syms.keys()),
    "cosd": "2000-01-01",
    "coed": "2019-02-01",
}

data = pd.read_csv(
    url + "?" + urllib.parse.urlencode(params, safe=","),
    na_values={"."},
    parse_dates=["DATE"],
    index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
#                  usd  term_spread      gold
# DATE                                       
# 2000-02-01  0.012580    -1.409091  0.057152
# 2000-03-01 -0.000113     2.000000 -0.047034
# 2000-04-01  0.005634     0.518519 -0.023520
# 2000-05-01  0.022017    -0.097561 -0.016675
# 2000-06-01 -0.010116     0.027027  0.036599

y = data.usd
x = data.drop('usd', axis=1)

window = 12  # months
model = PandasRollingOLS(y=y, x=x, window=window)

print(model.beta.head())  # Coefficients excluding the intercept
#             term_spread      gold
# DATE                             
# 2001-01-01     0.000033 -0.054261
# 2001-02-01     0.000277 -0.188556
# 2001-03-01     0.002432 -0.294865
# 2001-04-01     0.002796 -0.334880
# 2001-05-01     0.002448 -0.241902

print(model.fstat.head())
# DATE
# 2001-01-01    0.136991
# 2001-02-01    1.233794
# 2001-03-01    3.053000
# 2001-04-01    3.997486
# 2001-05-01    3.855118
# Name: fstat, dtype: float64

print(model.rsq.head())  # R-squared
# DATE
# 2001-01-01    0.029543
# 2001-02-01    0.215179
# 2001-03-01    0.404210
# 2001-04-01    0.470432
# 2001-05-01    0.461408
# Name: rsq, dtype: float64

Forced answered 11/6, 2017 at 17:32 Comment(5)

@CharlesPlager thanks for bringing that to my attention, link is updated. – Forced 1/11, 2017 at 14:33

Your sample code does not work running on Python 3.6.1. See repl.it/@SamArthur/InfatuatedRubberyResources – Morrell 15/2, 2019 at 20:3

@SamArthurGillam I've updated to a working example. The exception was raised because pandas-datareader was not correctly pulling in monthly frequency. (Looks like an fq param was added to the St. Louis Fed URLs for downloading.) That in turn meant that one of the columns was a bunch of zeros, which will cause a singular matrix and blow things up. – Forced 8/3, 2019 at 16:0

Hi brad, I find you example data kind of irritating. Just to make it clear: Your rebuilt uses term spread and gold changes as explanatory variables for changes in the trade weighted usd value? And the term spread change, or first column of x is therefore the risk free return and the gold change the market return? Why don't you use more simple variables like the change of a stock for y, and tbill rates(not changes) as risk free return and sp500 changes as market return? – Merwin 16/8, 2020 at 6:3

@LucaReichelt it is admittedly a contrived example, but calling an answer that someone has provided for you free of charge "irritating" is a bit much. You are welcome to suggest edits directly to the answer – Forced 17/8, 2020 at 18:35

Rolling beta with sklearn

import pandas as pd
from sklearn import linear_model

def rolling_beta(X, y, idx, window=255):

    assert len(X)==len(y)

    out_dates = []
    out_beta = []

    model_ols = linear_model.LinearRegression()

    for iStart in range(0, len(X)-window):        
        iEnd = iStart+window

        model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])

        #store output
        out_dates.append(idx[iEnd])
        out_beta.append(model_ols.coef_[0][0])

    return pd.DataFrame({'beta':out_beta}, index=out_dates)


df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)

Toolmaker answered 20/11, 2016 at 16:2 Comment(0)

Adding for completeness a speedier numpy-only solution which limits calculations only to the regression coefficients and the final estimate

Numpy rolling regression function

import numpy as np

def rolling_regression(y, x, window=60):
    """ 
    y and x must be pandas.Series
    """
# === Clean-up ============================================================
    x = x.dropna()
    y = y.dropna()
# === Trim acc to shortest ================================================
    if x.index.size > y.index.size:
        x = x[y.index]
    else:
        y = y[x.index]
# === Verify enough space =================================================
    if x.index.size < window:
        return None
    else:
    # === Add a constant if needed ========================================
        X = x.to_frame()
        X['c'] = 1
    # === Loop... this can be improved ====================================
        estimate_data = []
        for i in range(window, x.index.size+1):
            X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
            y_slice = y.values[i-window:i]
            coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
            estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
    # === Assemble ========================================================
        estimate = pandas.Series(data=estimate_data, index=x.index[window-1:]) 
        return estimate

Notes

In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols) appears to be somewhat slow

As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels. This product is expressed in the line starting as coeff = ...

Shadbush answered 30/4, 2017 at 17:47 Comment(0)

For rolling trend in one column, one can just use:

import numpy as np
def calc_trend(window:int = 30):
    df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)

However, in my case I wasted to find a trend with respect to date, where date was in another column. I had to create the functionality manually, but it is easy. First, convert from TimeDate to int64 representing days from t_0:

xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)

Then:

def calc_trend(window:int=30):
    for t in range(len(df)):
        if t < window//2:
            continue
        i0 = t  - window//2 # Start window
        i1 = i0 + window    # End window
        xvec = xdays[i0:i1]
        yvec = df['column_name'][i0:i1].values
        df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]

Bristle answered 6/12, 2020 at 3:41 Comment(0)

Recommended topics

Hot tags