Statsmodels logistic regression convergence problems
Asked Answered
T

0

6

I'm trying to run a logistic regression in statsmodels on a large design matrix (~200 columns). The features include a number of interactions, categorical features and semi-sparse (70%) integer features. Although my design matrix is not actually ill-conditioned, it seems to be somewhat close (according to numpy.linalg.matrix_rank, it is full-rank with tol=1e-3 but not with tol=1e-2). As a result, I'm struggling to get logistic regression to converge with any of the methods in statsmodels. Here's what I've tried so far:

  • method='newton': Did not converge after 1000 iterations; raised a singular matrix LinAlgError while trying to invert the Hessian.

  • method='bfgs': Warned of possible precision loss. Claimed convergence after 0 iterations, obviously had not actually converged.

  • method='nm': Claimed that it had converged, but model had a negative pseudo-R-squared and many coefficients were still zero (and very different from values they had converged to with better-conditioned submodels). I tried cranking down xtol to 1e-8 to no avail.

  • fit_regularized(method='l1'): reported Inequality constraints incompatible (Exit mode 4). Then raised a singular matrix LinAlgError while trying to compute the restricted Hessian inverse.

Treasury answered 11/12, 2014 at 1:24 Comment(10)
Can you share your data somewhere?Advanced
Alas, no; it's proprietary.Treasury
I found that standardizing the data helped with the convergence issues. This is an ok solution; I can't use formulas with it (because centering each column in the formula is a pain) and it makes the coefficients harder to interpret, but it at least gets it to converge.Treasury
The above "solution" also still failed when I added an 18-level categorical feature. There's some chance that this was due to actual collinearity, although I doubt it. I'll try to create example (random) data that exhibits the problem tomorrow.Treasury
Not wholly surprised. We have some code to do this internally but it's not hooked up by default yet.Advanced
Ah, excellent! That would make life a lot easier. Thanks for your excellent work on statsmodels--I know I'm asking a lot of it!Treasury
Do you use the parameters of the smaller model as starting values for the larger model when you add variables/terms?Fascista
It would be helpful if you have a ready example that fails and standardization makes it work. We could add it to the documentation.Advanced
I tried to reproduce the problem with publicly-available code and data. I didn't get all the way there in the time I allotted, but I at least managed to reproduce some of the parts of it here. Namely, I found that when using a spline basis the matrix appeared to be full-rank only for very low tolerances, and that logistic regression appeared to converge but gave meaningless confidence intervals and p-values. However, standardization didn't make this one work any better.Treasury
A similar question in Cross ValidatedJarvisjary

© 2022 - 2024 — McMap. All rights reserved.