I am working on a logistic regression model and I am having trouble understanding how to take the model fit from my training set onto my testing set. Sorry, I am new to python and VERY new to statsmodels..
import pandas as pd
import statsmodels.api as sm
from sklearn import cross_validation
independent_vars = phy_train.columns[3:]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(phy_train[independent_vars], phy_train['target'], test_size=0.3, random_state=0)
X_train = pd.DataFrame(X_train)
X_train.columns = independent_vars
X_test = pd.DataFrame(X_test)
X_test.columns = independent_vars
y_train = pd.DataFrame(y_train)
y_train.columns = ['target']
y_test = pd.DataFrame(y_test)
y_test.columns = ['target']
logit = sm.Logit(y_train,X_train[subset],missing='drop')
result = logit.fit()
print result.summary()
y_pred = logit.predict(X_test[subset])
From the last line, I get this error:
y_pred = logit.predict(X_test[subset]) Traceback (most recent call last): File "", line 1, in File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\statsmodels\discrete\discrete_model.py", line 378, in predict return self.cdf(np.dot(exog, params)) ValueError: matrices are not aligned
My training and testing data set have the same number of variables so I am sure I am misunderstanding what the logit.predict() is actually doing.
np.asarray(X_train[subset]).shape
andnp.asarray(X_test[subset]).shape
have the same second value? – Pratt