Pandas rolling regression: alternatives to looping
Asked Answered
T

2

35

I got good use out of pandas' MovingOLS class (source here) within the deprecated stats/ols module. Unfortunately, it was gutted completely with pandas 0.20.

The question of how to run rolling OLS regression in an efficient manner has been asked several times (here, for instance), but phrased a little broadly and left without a great answer, in my view.

Here are my questions:

  1. How can I best mimic the basic framework of pandas' MovingOLS? The most attractive feature of this class was the ability to view multiple methods/attributes as separate time series--i.e. coefficients, r-squared, t-statistics, etc without needing to re-run regression. For example, you could create something like model = pd.MovingOLS(y, x) and then call .t_stat, .rmse, .std_err, and the like. In the example below, conversely, I don't see a way around being forced to compute each statistic separately. Is there a method that doesn't involve creating sliding/rolling "blocks" (strides) and running regressions/using linear algebra to get model parameters for each?

  2. More broadly, what's going on under the hood in pandas that makes rolling.apply not able to take more complex functions?* When you create a .rolling object, in layman's terms, what's going on internally--is it fundamentally different from looping over each window and creating a higher-dimensional array as I'm doing below?

*Namely, func passed to .apply:

Must produce a single value from an ndarray input *args and **kwargs are passed to the function

Here's where I'm currently at with some sample data, regressing percentage changes in the trade weighted dollar on interest rate spreads and the price of copper. (This doesn't make a ton of sense; just picked these randomly.) I've taken it out of a class-based implementation and tried to strip it down to a simpler script.

from datetime import date
from pandas_datareader.data import DataReader
import statsmodels.formula.api as smf

syms = {'TWEXBMTH' : 'usd', 
        'T10Y2YM' : 'term_spread', 
        'PCOPPUSDM' : 'copper'
       }

start = date(2000, 1, 1)
data = (DataReader(syms.keys(), 'fred', start)
        .pct_change()
        .dropna())
data = data.rename(columns = syms)
data = data.assign(intercept = 1.) # required by statsmodels OLS

def sliding_windows(x, window):
    """Create rolling/sliding windows of length ~window~.

    Given an array of shape (y, z), it will return "blocks" of shape
    (x - window + 1, window, z)."""

    return np.array([x[i:i + window] for i 
                    in range(0, x.shape[0] - window + 1)])

data.head(3)
Out[33]: 
                 usd  term_spread    copper  intercept
DATE                                                  
2000-02-01  0.012573    -1.409091 -0.019972        1.0
2000-03-01 -0.000079     2.000000 -0.037202        1.0
2000-04-01  0.005642     0.518519 -0.033275        1.0

window = 36
wins = sliding_windows(data.values, window=window)
y, x = wins[:, :, 0], wins[:, :, 1:]

coefs = []

for endog, exog in zip(y, x):
    model = smf.OLS(endog, exog).fit()
        # The full set of model attributes gets lost with each loop
    coefs.append(model.params)

df = pd.DataFrame(coefs, columns=data.iloc[:, 1:].columns,
                  index=data.index[window - 1:])

df.head(3) # rolling 36m coefficients
Out[70]: 
            term_spread    copper  intercept
DATE                                        
2003-01-01    -0.000122 -0.018426   0.001937
2003-02-01     0.000391 -0.015740   0.001597
2003-03-01     0.000655 -0.016811   0.001546
Toilette answered 6/6, 2017 at 1:31 Comment(4)
As a basic workaround I would use your loop and store all attributes and results that you need in one pass. Here is my not-an-answer because I don't know pandas well enough github.com/statsmodels/statsmodels/issues/…Manx
Here are my thoughts. I am an hardcore Pandas and R user: For anything regression-related, use R, not Python. Granted, you can do approx. the same thing, but Python cannot match the breadth of R's packages. Plus, you have packages like broom that allow you to very easily catch these regressions statistics you care about and either put them into a dataframe or into latex tables.Shaffer
Does MovingOLS not work, or was it just removed? Could you just take the MovingOLS code and effectively make your own little library out of it?Catharine
It's been removed completely. And I'd rather not simply copy 1400 lines of code that have more than a handful of dependencies.Toilette
T
16

I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.

It has three core classes:

  • OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays
  • RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
  • PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.

Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

An example:

import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS

# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"

syms = {
    "TWEXBMTH" : "usd", 
    "T10Y2YM" : "term_spread", 
    "GOLDAMGBD228NLBM" : "gold",
}

params = {
    "fq": "Monthly,Monthly,Monthly",
    "id": ",".join(syms.keys()),
    "cosd": "2000-01-01",
    "coed": "2019-02-01",
}

data = pd.read_csv(
    url + "?" + urllib.parse.urlencode(params, safe=","),
    na_values={"."},
    parse_dates=["DATE"],
    index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
#                  usd  term_spread      gold
# DATE                                       
# 2000-02-01  0.012580    -1.409091  0.057152
# 2000-03-01 -0.000113     2.000000 -0.047034
# 2000-04-01  0.005634     0.518519 -0.023520
# 2000-05-01  0.022017    -0.097561 -0.016675
# 2000-06-01 -0.010116     0.027027  0.036599

y = data.usd
x = data.drop('usd', axis=1)

window = 12  # months
model = PandasRollingOLS(y=y, x=x, window=window)

print(model.beta.head())  # Coefficients excluding the intercept
#             term_spread      gold
# DATE                             
# 2001-01-01     0.000033 -0.054261
# 2001-02-01     0.000277 -0.188556
# 2001-03-01     0.002432 -0.294865
# 2001-04-01     0.002796 -0.334880
# 2001-05-01     0.002448 -0.241902

print(model.fstat.head())
# DATE
# 2001-01-01    0.136991
# 2001-02-01    1.233794
# 2001-03-01    3.053000
# 2001-04-01    3.997486
# 2001-05-01    3.855118
# Name: fstat, dtype: float64

print(model.rsq.head())  # R-squared
# DATE
# 2001-01-01    0.029543
# 2001-02-01    0.215179
# 2001-03-01    0.404210
# 2001-04-01    0.470432
# 2001-05-01    0.461408
# Name: rsq, dtype: float64
Toilette answered 10/6, 2017 at 22:33 Comment(2)
The main problem with this approach is that keeping all OLS instances around requires a lot of memory.Manx
@user333700 I've made some major changes that should be much kinder to memory if you're interested in taking a look.Toilette
R
5

Use a custom rolling apply function.

import numpy as np

df['slope'] = df.values.rolling(window=125).apply(lambda x: np.polyfit(np.array(range(0,125)), x, 1)[0], raw=True)
Reiner answered 18/4, 2020 at 8:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.