How to do OLS Regression with the latest version of Pandas
Asked Answered
V

2

6

I wanted to run a rolling 1000 window OLS regression estimation of the dataset for my evaluation found at the following URL:

https://drive.google.com/open?id=0B2Iv8dfU4fTUa3dPYW5tejA0bzg

I tried using the following Python script with pandas version 0.20.2.

# /usr/bin/python -tt

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.formula.api import ols

df = pd.read_csv('estimated.csv', names=('x','y'))

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['y']], 
                               window_type='rolling', window=1000, intercept=True)
df['Y_hat'] = model.y_predict

However, when I run my Python script, I am getting this error: AttributeError: module 'pandas.stats' has no attribute 'ols'. I found out the reason for this error is because it is removed since Pandas version 0.20.0as we can see it from the following link.

https://github.com/pandas-dev/pandas/pull/11898

How can we do OLS Regression with the latest version of Pandas?

Viminal answered 22/6, 2017 at 21:38 Comment(0)
G
5

While normally I would suggest applying something like statsmodels.ols on a rolling basis*, your dataset is large (length-1000 windows on 258k rows) and you will run into a memory error that way. Therefore, you could use the linear algebra approach to calculating coefficients and then apply these coefficients to each window of your explanatory variable. For more on this, see A Matrix Formulation of the Multiple Regression Model.

* To see an implementation of statsmodels, see a wrapper I created here. An example is here.

Realize that yhat here is not an nx1 vector--it is a bunch of nx1 vectors stacked on top of each other, i.e. you have 1 set of predictions per rolling 1000-period block. So the shape of your predictions will be (257526, 1000), as shown below.

import numpy as np
import pandas as pd

df = pd.read_csv('input/estimated.csv', names=('x','y'))

def rolling_windows(a, window):
    """Creates rolling-window 'blocks' of length `window` from `a`.

    Note that the orientation of rows/columns follows that of pandas.

    Example
    =======
    onedim = np.arange(20)
    twodim = onedim.reshape((5,4))

    print(twodim)
    [[ 0  1  2  3]
     [ 4  5  6  7]
     [ 8  9 10 11]
     [12 13 14 15]
     [16 17 18 19]]

    print(rwindows(onedim, 3)[:5])
    [[0 1 2]
     [1 2 3]
     [2 3 4]
     [3 4 5]
     [4 5 6]]

    print(rwindows(twodim, 3)[:5])
    [[[ 0  1  2  3]
      [ 4  5  6  7]
      [ 8  9 10 11]]

     [[ 4  5  6  7]
      [ 8  9 10 11]
      [12 13 14 15]]

     [[ 8  9 10 11]
      [12 13 14 15]
      [16 17 18 19]]]
    """

    if isinstance(a, (Series, DataFrame)):
        a = a.values
    if a.ndim == 1:
        a = a.reshape(-1, 1)
    shape = (a.shape[0] - window + 1, window) + a.shape[1:]
    strides = (a.strides[0],) + a.strides
    windows = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return np.squeeze(windows)

def coefs(y, x):
    return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))

rendog = rolling_windows(df.x.values, 1000)
rexog = rolling_windows(df.drop('x', axis=1).values, 1000)

preds = list()
for endog, exog in zip(rendog, rexog):
    pred = np.sum(coefs(endog, exog).T * exog, axis=1)
    preds.append(pred)
preds = np.array(preds)

print(preds.shape)
(257526, 1000)

Lastly: have you considered using a Random Forest Classifier here, given that your y variable is discrete?

Goosestep answered 23/6, 2017 at 12:37 Comment(0)
C
0

you just import as below library for your

from statsmodels.regression.linear_model import OLS
Coreycorf answered 13/2, 2020 at 5:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.