I'd like to find the standard deviation and confidence intervals for an out-of-sample prediction from an OLS model.
This question is similar to Confidence intervals for model prediction, but with an explicit focus on using out-of-sample data.
The idea would be for a function along the lines of wls_prediction_std(lm, data_to_use_for_prediction=out_of_sample_df)
, that returns the prstd, iv_l, iv_u
for that out of sample dataframe.
For instance:
import pandas as pd
import random
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
df = pd.DataFrame({"y":[x for x in range(10)],
"x1":[(x*5 + random.random() * 2) for x in range(10)],
"x2":[(x*2.1 + random.random()) for x in range(10)]})
out_of_sample_df = pd.DataFrame({"x1":[(x*3 + random.random() * 2) for x in range(10)],
"x2":[(x + random.random()) for x in range(10)]})
formula_string = "y ~ x1 + x2"
lm = smf.ols(formula=formula_string, data=df).fit()
# Prediction works fine:
print(lm.predict(out_of_sample_df))
# I can also get std and CI for in-sample data:
prstd, iv_l, iv_u = wls_prediction_std(lm)
print(prstd)
# I cannot figure out how to get std and CI for out-of-sample data:
try:
print(wls_prediction_std(lm, exog= out_of_sample_df))
except ValueError as e:
print(str(e))
#returns "ValueError: wrong shape of exog"
# trying to concatenate the DFs:
df_both = pd.concat([df, out_of_sample_df],
ignore_index = True)
# Only returns results for the data from df, not from out_of_sample_df
lm2 = smf.ols(formula=formula_string, data=df_both).fit()
prstd2, iv_l2, iv_u2 = wls_prediction_std(lm2)
print(prstd2)