How to predict new values using statsmodels.formula.api (python)
Asked Answered
S

4

8

I trained the logistic model using the following, from breast cancer data and ONLY using one feature 'mean_area'

from statsmodels.formula.api import logit
logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

There is a built in predict method in the trained model. However that gives the predicted values of all the training samples. As follows

predictions = result.predict()

Suppose I want the prediction for a new value say 30 How do I used the trained model to out put the value? (rather than reading the coefficients and computing manually)

Subshrub answered 15/8, 2016 at 14:35 Comment(0)
S
6

You can provide new values to the .predict() model as illustrated in output #11 in this notebook from the docs for a single observation. You can provide multiple observations as 2d array, for instance a DataFrame - see docs.

Since you are using the formula API, your input needs to be in the form of a pd.DataFrame so that the column references are available. In your case, you could use something like .predict(pd.DataFrame({'mean_area': [1,2,3]}).

statsmodels .predict() uses the observations used for fitting only as default when no alternative is provided.

Sweeten answered 15/8, 2016 at 15:49 Comment(6)
Thanks for the answer. I had a look at the notebook, however in my case when I try to give .predict(30) it throws an error " 'int' object has no attribute 'getitem'' .Subshrub
You are getting this error because the exog parameter has to be array-like, so you'd have to use [30]. Arrays have getitem method because they can contain multiple items in contrast to int.Sweeten
Thanks when I try .predict([30]) I get the following error. "TypeError: list indices must be integers, not str"Subshrub
Sorry because of the formula api the input as to be as DataFrame, see updated answer.Sweeten
Note that you can simply pass a dictionary into any of statsmodel's API's that accept dataframes - there is no need to create a dataframe unnecessarily. The following example notebook shows this in the last step: statsmodels.org/dev/examples/notebooks/generated/predict.htmlStearns
The links listed are now dead.Despotic
R
1
import statsmodels.formula.api as smf


model = smf.ols('y ~ x', data=df).fit()

# Predict for a list of observations, list length can be 1 to many..**
prediction = model.get_prediction(exog=dict(x=[5,10,25])) 
prediction.summary_frame(alpha=0.05)
Ruche answered 15/4, 2019 at 18:30 Comment(0)
O
0

I had difficulty predicting values using a fresh pandas dataframe. So I added data to be predicted to original dataset post fitting

   y = data['price']
   x1 = data[['size', 'year']]
   data.columns
   #Index(['price', 'size', 'year'], dtype='object')
   x=sm.add_constant(x1)
   results = sm.OLS(y,x).fit()
   results.summary()
   ## predict on unknown data
   data = data.append(pd.DataFrame({'size': [853.0,777], 'year': [2012.0,2013], 'price':[None, None]}))
   data.tail()
   new_x = data.loc[data.price.isnull(), ['size', 'year']]
   results.predict(sm.add_constant(new_x))
Oleson answered 9/5, 2021 at 13:12 Comment(0)
F
0

This is already answered but I hope this will help.

According to the documentation, the first parameter is "exog".

exog : array_like, optional The values for which you want to predict

Further it says,

"If a formula was used, then exog is processed in the same way as the original data. This transformation needs to have key access to the same variable names, and can be a pandas DataFrame or a dict like object that contains numpy arrays.

If no formula was used, then the provided exog needs to have the same number of columns as the original exog in the model. No transformation of the data is performed except converting it to a numpy array.

Row indices as in pandas data frames are supported, and added to the returned prediction"

from statsmodels.formula.api import logit

logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

Therefore, you can provide a pandas dataframe (Ex: df) for the exog parameter and the dataframe should contain mean_area as a column. Because 'mean_area' is the predictor or the independent variable.

predictions = logistic_model.predict(exog=df)
Furmenty answered 1/7, 2021 at 20:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.