How to fit a model to my testing set in statsmodels (python)
Asked Answered
T

1

16

I am working on a logistic regression model and I am having trouble understanding how to take the model fit from my training set onto my testing set. Sorry, I am new to python and VERY new to statsmodels..

import pandas as pd
import statsmodels.api as sm
from sklearn import cross_validation

independent_vars = phy_train.columns[3:]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(phy_train[independent_vars], phy_train['target'], test_size=0.3, random_state=0)
X_train = pd.DataFrame(X_train)
X_train.columns = independent_vars
X_test = pd.DataFrame(X_test)
X_test.columns = independent_vars
y_train = pd.DataFrame(y_train)
y_train.columns = ['target']
y_test = pd.DataFrame(y_test)
y_test.columns = ['target']
logit = sm.Logit(y_train,X_train[subset],missing='drop')
result = logit.fit()
print result.summary()

y_pred = logit.predict(X_test[subset])

From the last line, I get this error:

y_pred = logit.predict(X_test[subset]) Traceback (most recent call last): File "", line 1, in File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\statsmodels\discrete\discrete_model.py", line 378, in predict return self.cdf(np.dot(exog, params)) ValueError: matrices are not aligned

My training and testing data set have the same number of variables so I am sure I am misunderstanding what the logit.predict() is actually doing.

Tobin answered 13/4, 2014 at 21:32 Comment(2)
Do np.asarray(X_train[subset]).shape and np.asarray(X_test[subset]).shape have the same second value?Pratt
@user333700 Yes, they do.Tobin
P
22

There are two predict methods.

logit in your example is the model instance. The model instance doesn't know about the estimation results. The model predict has a different signature because it needs the parameters also logit.predict(params, exog). This is mainly interesting for internal usage.

What you want is the predict method of the results instance. In your example

y_pred = result.predict(X_test[subset])

should give the correct results. It uses the estimated parameters in the prediction with your new test data of explanatory variables, X_test.

Calling model.fit() returns an instance of a results class that provides access to additional post-estimation statistics and analysis, and to prediction.

Pratt answered 13/4, 2014 at 23:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.