logit regression and singular Matrix error in Python
Asked Answered
D

3

24

am trying to run logit regression for german credit data (www4.stat.ncsu.edu/~boos/var.select/german.credit.html). To test the code, I have used only numerical variables and tried regressing it with the result using the following code.

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

df = pd.read_csv("germandata.txt",delimiter=' ')
df.columns = ["chk_acc","duration","history","purpose","amount","savings_acc","employ_since","install_rate","pers_status","debtors","residence_since","property","age","other_plans","housing","existing_credit","job","no_people_liab","telephone","foreign_worker","admit"]

#pls note that I am only retaining numeric variables
cols_to_keep = ['admit','duration', 'amount', 'install_rate','residence_since','age','existing_credit','no_people_liab']

# rank of cols_to_keep is 8
print np.linalg.matrix_rank(df[cols_to_keep].values)
data = df[cols_to_keep]

data['intercept'] = 1.0

train_cols = data.columns[1:]

#to check the rank of train_cols, which in this case is 8
print np.linalg.matrix_rank(data[train_cols].values)

#fit logit model
logit = sm.Logit(data['admit'], data[train_cols])
result = logit.fit()

All the 8.0 columns seem independent when I check the data. Inspite of this I am getting Singular Matrix Error. Can you please help?

Thanks

Daphie answered 20/12, 2013 at 12:32 Comment(0)
W
32

The endog y variable needs to be zero, one. In this dataset it has values in 1 and 2. If we subtract one, then it produces the results.

>>> logit = sm.Logit(data['admit'] - 1, data[train_cols])
>>> result = logit.fit()
>>> print result.summary()
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  admit   No. Observations:                  999
Model:                          Logit   Df Residuals:                      991
Method:                           MLE   Df Model:                            7
Date:                Fri, 19 Sep 2014   Pseudo R-squ.:                 0.05146
Time:                        10:06:06   Log-Likelihood:                -579.09
converged:                       True   LL-Null:                       -610.51
                                        LLR p-value:                 4.103e-11
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
duration            0.0261      0.008      3.392      0.001         0.011     0.041
amount           7.062e-05    3.4e-05      2.075      0.038      3.92e-06     0.000
install_rate        0.2039      0.073      2.812      0.005         0.062     0.346
residence_since     0.0411      0.067      0.614      0.539        -0.090     0.172
age                -0.0213      0.007     -2.997      0.003        -0.035    -0.007
existing_credit    -0.1560      0.130     -1.196      0.232        -0.412     0.100
no_people_liab      0.1264      0.201      0.628      0.530        -0.268     0.521
intercept          -1.5746      0.430     -3.661      0.000        -2.418    -0.732
===================================================================================

However, in other cases it is possible that the Hessian is not positive definite when we evaluate it far away from the optimum, for example at bad starting values. Switching to an optimizer that does not use the Hessian often succeeds in those cases. For example, scipy's 'bfgs' is a good optimizer that works in many cases

result = logit.fit(method='bfgs')
Worriment answered 19/9, 2014 at 14:18 Comment(2)
This has been fixed to give a good error message. github.com/statsmodels/statsmodels/pull/1978Cathepsin
Great answer. One other thing to check: If your exog variables are all zero (as was my case), it will cause this error as well.Saccharo
D
6

I've managed to solve this by removing as well low variance columns:

from sklearn.feature_selection import VarianceThreshold

def variance_threshold_selector(data, threshold=0.5):
    # https://mcmap.net/q/568569/-retain-feature-names-after-scikit-feature-selection
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

# min_variance = .9 * (1 - .9)  # You can play here with different values.
min_variance = 0.0001
low_variance = variance_threshold_selector(df, min_variance) 
print('columns removed:')
df.columns ^ low_variance.columns
df.shape
df.shape
X = low_variance
# (Logit(y_train, X), logit.fit()... etc)

To give a bit more of context: I did one-hot encoding to some categorical data prior to this step, and some of the columns had very few 1's

Dissimilarity answered 3/6, 2020 at 1:16 Comment(1)
Thanks for the comments. I think doing this would solve the issue by chance. The singular matrix happens when the design matrix (or categorical data after one-hot encoding) has columns that can add up to a column of 1 (intercept).Goode
M
0

Probably this may help someone who is a noob like me!

Make sure that you are not including the target along with the predictors. I accidentally included the target along with the predictors and struggled with this for a long time, for such a silly mistake.

Explanation: Since the target which you included along with the predictors is in perfect correlation with itself, it would give you a Singular matrix error.

Marika answered 3/3, 2023 at 18:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.