am trying to run logit regression for german credit data (www4.stat.ncsu.edu/~boos/var.select/german.credit.html). To test the code, I have used only numerical variables and tried regressing it with the result using the following code.
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("germandata.txt",delimiter=' ')
df.columns = ["chk_acc","duration","history","purpose","amount","savings_acc","employ_since","install_rate","pers_status","debtors","residence_since","property","age","other_plans","housing","existing_credit","job","no_people_liab","telephone","foreign_worker","admit"]
#pls note that I am only retaining numeric variables
cols_to_keep = ['admit','duration', 'amount', 'install_rate','residence_since','age','existing_credit','no_people_liab']
# rank of cols_to_keep is 8
print np.linalg.matrix_rank(df[cols_to_keep].values)
data = df[cols_to_keep]
data['intercept'] = 1.0
train_cols = data.columns[1:]
#to check the rank of train_cols, which in this case is 8
print np.linalg.matrix_rank(data[train_cols].values)
#fit logit model
logit = sm.Logit(data['admit'], data[train_cols])
result = logit.fit()
All the 8.0 columns seem independent when I check the data. Inspite of this I am getting Singular Matrix Error. Can you please help?
Thanks