Predicting values using an OLS model with statsmodels
Asked Answered
G

2

15

I calculated a model using OLS (multiple linear regression). I divided my data to train and test (half each), and then I would like to predict values for the 2nd half of the labels.

model = OLS(labels[:half], data[:half])
predictions = model.predict(data[half:])

The problem is that I get and error: File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/regression/linear_model.py", line 281, in predict return np.dot(exog, params) ValueError: matrices are not aligned

I have the following array shapes: data.shape: (426, 215) labels.shape: (426,)

If I transpose the input to model.predict, I do get a result but with a shape of (426,213), so I suppose its wrong as well (I expect one vector of 213 numbers as label predictions):

model.predict(data[half:].T)

Any idea how to get it to work?

Gumm answered 4/11, 2012 at 12:21 Comment(0)
G
21

For statsmodels >=0.4, if I remember correctly

model.predict doesn't know about the parameters, and requires them in the call see http://statsmodels.sourceforge.net/stable/generated/statsmodels.regression.linear_model.OLS.predict.html

What should work in your case is to fit the model and then use the predict method of the results instance.

model = OLS(labels[:half], data[:half])
results = model.fit()
predictions = results.predict(data[half:])

or shorter

results = OLS(labels[:half], data[:half]).fit()
predictions = results.predict(data[half:])

http://statsmodels.sourceforge.net/stable/generated/statsmodels.regression.linear_model.RegressionResults.predict.html with missing docstring

Note: this has been changed in the development version (backwards compatible), that can take advantage of "formula" information in predict http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.predict.html

Guinea answered 4/11, 2012 at 13:23 Comment(3)
Although this is correct answer to the question BIG WARNING about the model fitting and data splitting. You should have used 80% of data (or bigger part) for training/fitting and 20% ( the rest ) for testing/predicting. Splitting data 50:50 is like Schrodingers cat. We have no confidence that our data are all good or all wrong. Thus confidence in the model is somewhere in the middle. We want to have better confidence in our model thus we should train on more data then to test on.Moldboard
The 70/30 or 80/20 splits are rules of thumb for small data sets (up to hundreds of thousands of examples). This should not be seen as THE rule for all cases. In deep learning where you often work with billions of examples, you typically want to train on 99% of the data and test on 1%, which can still be tens of millions of records. A 50/50 split is generally a bad idea though.Excrete
As long as your resutls are statisticaly significant, they are acceptable. For example, in a case where your data is highly unbalanced, none of the above-mentioned methods result in a statistically significant comparison because the data is highly biased and even a simple averaging will resutl in frequently correct answers!!!Independent
A
7

You can also call get_prediction method of the Results object to get the prediction together with its error estimate and confidence intervals. Example:

import numpy as np
import statsmodels.api as sm

X = np.array([0, 1, 2, 3])
y = np.array([1, 2, 3.5, 4])
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()

predict:

# Predict at x=2.5
X_test = np.array([1, 2.5])  # "1" refers to the intercept term
results.get_prediction(X_test).summary_frame(alpha=0.05)  # alpha = significance level for confidence interval

gives:

    mean    mean_se mean_ci_lower   mean_ci_upper   obs_ci_lower    obs_ci_upper
0   3.675   0.198431    2.821219    4.528781    2.142416    5.207584

where mean_ci refers to the confidence interval and obs_ci refers to the prediction interval.

Ampereturn answered 9/1, 2022 at 15:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.