scikit learn: how to check coefficients significance
Asked Answered
T

1

19

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!

Tapster answered 4/8, 2014 at 16:9 Comment(7)
If your logistic regression object is called lr, try looking at lr.coef_. Is this what you are looking for?Angelenaangeleno
no, well, coef_ is the coefficients value, and i want is the significance of this value: z-score and the p-value. its when you assume a test hypothesis that the coefficient is 0 (null hypothesis H_0=0) and an alternative hypothesis H_1!=0, and then p-value tells you basically if you can reject the H_0 (when the H_0 is tiny) or not (when H_0->1)Tapster
With logistic regression I have the feeling that you can only get those using resampling and building empirical distributions on the coef_ of each sample.Angelenaangeleno
well, yes, but i was wondering if there is a built-in method with sklearn, like the summary for a "glm class" object in R...Tapster
I'm sorry, there isn't one. If this feature exists in R, why don't you use R? In any case, this type of classic statistic for e.g. linear regression only works in the n_samples <= n_features setting.Angelenaangeleno
i actually did what you suggested with building empirical distributions for sub-samples - worked nicely! Thank you!Tapster
If that was helpful you may also check stability selection and the sklearn implementations of randomized logistic regression. These can provide you with a stable selection of features.Angelenaangeleno
A
28

Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.

If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:

import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression

def logit_pvalue(model, x):
    """ Calculate z-scores for scikit-learn LogisticRegression.
    parameters:
        model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
        x:     matrix on which the model was fit
    This function uses asymtptics for maximum likelihood estimates.
    """
    p = model.predict_proba(x)
    n = len(p)
    m = len(model.coef_[0]) + 1
    coefs = np.concatenate([model.intercept_, model.coef_[0]])
    x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
    ans = np.zeros((m, m))
    for i in range(n):
        ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
    vcov = np.linalg.inv(np.matrix(ans))
    se = np.sqrt(np.diag(vcov))
    t =  coefs/se  
    p = (1 - norm.cdf(abs(t))) * 2
    return p

# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))

# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()

The outputs of print() are identical, and they happen to be coefficient p-values.

[ 0.11413093  0.08779978]
[ 0.11413093  0.08779979]

sm_model.summary() also prints a nicely formatted HTML summary.

Archle answered 2/11, 2017 at 15:38 Comment(10)
Thank you a lot for your answer, how is p_value reliable and do statsmodel and spss use the same approache?Marchetti
@Marchetti 1) Could you please define what you mean by "reliable p-value"? It is a special case of MLE p-value. So I suggest to look for theory of "asymptotic properties of maximum likelihood estimates" to gain overall understanding of its reliability.Archle
@Marchetti 2) Yes, Statsmodels do calculate p-values for logistic regression in the same way. The covariance matrix of parameters (statsmodels.base.model.LikelihoodModelResults.normalized_cov_params attribure) is calculated as inverse Hessian in the statsmodels.base.model.LikelihoodModel.fit method, and is further used for p-value estimation and other purposes. And as far as I know, SPSS does principally the same.Archle
That looks great. Just one more question, that method looks pretty similar to Wald statistic or not?Marchetti
Yes, this p-value is exactly the significance of Wald test. Both are based on the assumption that the value(estimate-hypothesis) / std.dev(estimate) is asymtotically standard normal, if the hypothesis is true. See en.wikipedia.org/wiki/Wald_test#Test_on_a_single_parameterArchle
@DavidDale do you have a reference on why sklearn does not support statistical inference?Idyllic
@Idyllic because of focus. They just cannot support everything (with very limited resources), and they choose to cover different ML algorithms more fully, instead of doing other things.Archle
@David Dale Do you make the point of the C term because you don't want a model that will overfit or C has a strong influence on the p_value?Lucho
C has strong influence on the coefficients themselves (and through them on the p-value, of course). Scikit-learn uses C=1 by default; Statsmodels doesn't regularize at all (which is equivalent to C=infinity). Therefore, if we want scikit-learn and statsmodels to have similar coefficients, we need to set C very high in scikit-learn.Archle
Can we still use this function definition even if we have a different C, say, C = 0.5? Does the wald test result differ in this case?Monodic

© 2022 - 2024 — McMap. All rights reserved.