In the iris example we can perfectly predict Setosa. This causes problems with (partial) perfect separation in Logit and MNLogit.
Perfect separation is good for prediction, but the parameters of logit go to infinity. In this case I get a Singular Matrix error instead of Nans with a relatively recent version of statsmodels master (on Windows).
The default optimizer for the discrete models is Newton which fails when the Hessian becomes singular. Other optimizers that don't use the information from the Hessian are able to finish the optimization. For example using 'bfgs', I get
>>> mdl_fit = mdl.fit(method='bfgs')
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.057112
Iterations: 35
Function evaluations: 37
Gradient evaluations: 37
e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\model.py:471: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
The predicted probabilities for Setosa are essentially (1, 0, 0), that is they are perfectly predicted
>>> fitted = mdl_fit.predict()
>>> fitted[y=='setosa'].min(0)
array([ 9.99497636e-01, 2.07389867e-11, 1.71740822e-38])
>>> fitted[y=='setosa'].max(0)
array([ 1.00000000e+00, 5.02363854e-04, 1.05778255e-20])
However, because of perfect separation the parameters are not identified, the values are determined mostly by the stopping criterion of the optimizer and the standard errors are very large.
>>> print(mdl_fit.summary())
MNLogit Regression Results
==============================================================================
Dep. Variable: Species No. Observations: 150
Model: MNLogit Df Residuals: 140
Method: MLE Df Model: 8
Date: Mon, 20 Jul 2015 Pseudo R-squ.: 0.9480
Time: 04:08:04 Log-Likelihood: -8.5668
converged: False LL-Null: -164.79
LLR p-value: 9.200e-63
=====================================================================================
Species=versicolor coef std err z P>|z| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length -1.4959 444.817 -0.003 0.997 -873.321 870.330
Sepal.Width -8.0560 282.766 -0.028 0.977 -562.267 546.155
Petal.Length 11.9301 374.116 0.032 0.975 -721.323 745.184
Petal.Width 1.7039 759.366 0.002 0.998 -1486.627 1490.035
const 1.6444 1550.515 0.001 0.999 -3037.309 3040.597
--------------------------------------------------------------------------------------
Species=virginica coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Sepal.Length -8.0348 444.835 -0.018 0.986 -879.896 863.827
Sepal.Width -15.8195 282.793 -0.056 0.955 -570.083 538.444
Petal.Length 22.1797 374.155 0.059 0.953 -711.152 755.511
Petal.Width 14.0603 759.384 0.019 0.985 -1474.304 1502.425
const -6.5053 1550.533 -0.004 0.997 -3045.494 3032.483
=====================================================================================
About the implementation in statsmodels
Logit checks specifically for perfect separation and raises an Exception that can optionally be weakened to a Warning.
For other models like MNLogit, there is not yet an explicit check for perfect separation, largely for the lack of good test cases and easily identifiable general conditions. (several issues like https://github.com/statsmodels/statsmodels/issues/516 are still open)
My strategy in general:
When there is a convergence failure, then try different optimizers and different starting values (start_params
). If some optimizers succeed, then it might be a difficult optimization problem, either with the curvature of the objective function, badly scaled explanatory variables or similar. A useful check is to use the parameter estimates of robust optimizers like nm
or powell
as starting values for the optimizers that are more strict, like newton
or bfgs
.
If the results are still not good after convergence of some optimizers, then it might be an inherent problem with the data like perfect separation in Logit, Probit and several other models or a singular or near singular design matrix. In that case the model has to be changed. Recommendation for perfect separation can be found with a internet search.