Pandas/Statsmodel OLS predicting future values
Asked Answered
F

1

5

I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:

import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred

The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.

from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict

(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.

edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:

matrices are not aligned

edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:

month   monthly_data    monthly_data_smoothed5  monthly_data_smoothed8  monthly_data_smoothed12 monthly_data_smoothed3  date_delta
0   2011-01-31  3.711838e+11    3.711838e+11    3.711838e+11    3.711838e+11    3.711838e+11    0.000000
1   2011-02-28  3.776706e+11    3.750759e+11    3.748327e+11    3.746975e+11    3.755084e+11    0.919937
2   2011-03-31  4.547079e+11    4.127964e+11    4.083554e+11    4.059256e+11    4.207653e+11    1.938438
3   2011-04-30  4.688370e+11    4.360748e+11    4.295531e+11    4.257843e+11    4.464035e+11    2.924085
Figone answered 26/8, 2014 at 19:58 Comment(5)
without your data, one can only speculate. please post a self-contained example that includes code to generate your data. see sscce.org for more infoFasta
ok...I tried to post a copy and pasted data frame from an ipython output but of course it doesn't format correctly...Figone
do df.to_dict and paste thatFasta
Your first set of code looks ok. Why would you expect the values to be the same? Maybe look at smresults.summary() to see how well the model is fit.Vetter
It's not that I expect them to be the same really, it's just that it is returning values for 42 periods but I don't know which ones and the values are e+22 which is way too high.Figone
V
10

I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:

dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']

smresults = sm.OLS(y, X).fit()

dframe['pred'] = smresults.predict()

Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.

import statsmodels.formula.api as smf

smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()

dframe['pred'] = smresults.predict()

Edit:

To predict future values, just pass new data to .predict() For example, using the first model:

In [165]: smresults.predict(pd.DataFrame({'intercept': 1, 
                                          'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([  2.03927604e+11,   2.95182280e+11,   3.86436955e+11])

On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:

X = sm.add_constant(X)
Vetter answered 26/8, 2014 at 21:27 Comment(4)
Hello-thanks! a couple questions-the second code worked, however I'm wondering how to get a prediction for date_delta's beyond the range of my data. The dframe['intercept'] = 1 is returning an error of intercept' not in index. Further I am able to see the intercept precisely from the model summary, should I use that or does '1' encode something? Thanks again!Figone
sorry! I what is happening with the intercept! I just need to figure out how to predict future periods with this model now...Figone
@Figone - on the error - make sure that type(dframe) is a pandas a DataFrame, you shouldn't be getting that error. See edit for answers to your other questions.Vetter
very nice, thank you so much-I did a bunch of ewma's with different spans and am predicting future values using ewma and I wanted to compare it to OLS predictions on the original data and on the smoothed data...no idea what to trust but thanks for helping me with this piece!Figone

© 2022 - 2024 — McMap. All rights reserved.