scikit learn: how to check coefficients significance - McMap

About

scikit learn: how to check coefficients significance

Asked 4/8, 2014 at 16:9 Answered 2/11, 2017 at 15:38

scikit-learn logistic-regression anova dummy-data

T

1

19

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!

Tapster answered 4/8, 2014 at 16:9 Comment(7)

If your logistic regression object is called lr, try looking at lr.coef_. Is this what you are looking for? – Angelenaangeleno 4/8, 2014 at 16:44

no, well, coef_ is the coefficients value, and i want is the significance of this value: z-score and the p-value. its when you assume a test hypothesis that the coefficient is 0 (null hypothesis H_0=0) and an alternative hypothesis H_1!=0, and then p-value tells you basically if you can reject the H_0 (when the H_0 is tiny) or not (when H_0->1) – Tapster 5/8, 2014 at 7:31

With logistic regression I have the feeling that you can only get those using resampling and building empirical distributions on the coef_ of each sample. – Angelenaangeleno 5/8, 2014 at 8:8

well, yes, but i was wondering if there is a built-in method with sklearn, like the summary for a "glm class" object in R... – Tapster 5/8, 2014 at 12:32

I'm sorry, there isn't one. If this feature exists in R, why don't you use R? In any case, this type of classic statistic for e.g. linear regression only works in the n_samples <= n_features setting. – Angelenaangeleno 5/8, 2014 at 15:37

i actually did what you suggested with building empirical distributions for sub-samples - worked nicely! Thank you! – Tapster 6/8, 2014 at 16:45

If that was helpful you may also check stability selection and the sklearn implementations of randomized logistic regression. These can provide you with a stable selection of features. – Angelenaangeleno 6/8, 2014 at 18:34

A

28

Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.

If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:

import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression

def logit_pvalue(model, x):
    """ Calculate z-scores for scikit-learn LogisticRegression.
    parameters:
        model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
        x:     matrix on which the model was fit
    This function uses asymtptics for maximum likelihood estimates.
    """
    p = model.predict_proba(x)
    n = len(p)
    m = len(model.coef_[0]) + 1
    coefs = np.concatenate([model.intercept_, model.coef_[0]])
    x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
    ans = np.zeros((m, m))
    for i in range(n):
        ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
    vcov = np.linalg.inv(np.matrix(ans))
    se = np.sqrt(np.diag(vcov))
    t =  coefs/se  
    p = (1 - norm.cdf(abs(t))) * 2
    return p

# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))

# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()

The outputs of print() are identical, and they happen to be coefficient p-values.

[ 0.11413093  0.08779978]
[ 0.11413093  0.08779979]

sm_model.summary() also prints a nicely formatted HTML summary.

Archle answered 2/11, 2017 at 15:38 Comment(10)

Thank you a lot for your answer, how is p_value reliable and do statsmodel and spss use the same approache? – Marchetti 11/1, 2018 at 8:23

@Marchetti 1) Could you please define what you mean by "reliable p-value"? It is a special case of MLE p-value. So I suggest to look for theory of "asymptotic properties of maximum likelihood estimates" to gain overall understanding of its reliability. – Archle 11/1, 2018 at 9:59

@Marchetti 2) Yes, Statsmodels do calculate p-values for logistic regression in the same way. The covariance matrix of parameters (statsmodels.base.model.LikelihoodModelResults.normalized_cov_params attribure) is calculated as inverse Hessian in the statsmodels.base.model.LikelihoodModel.fit method, and is further used for p-value estimation and other purposes. And as far as I know, SPSS does principally the same. – Archle 11/1, 2018 at 9:59

That looks great. Just one more question, that method looks pretty similar to Wald statistic or not? – Marchetti 11/1, 2018 at 11:47

Yes, this p-value is exactly the significance of Wald test. Both are based on the assumption that the value(estimate-hypothesis) / std.dev(estimate) is asymtotically standard normal, if the hypothesis is true. See en.wikipedia.org/wiki/Wald_test#Test_on_a_single_parameter – Archle 11/1, 2018 at 13:10

@DavidDale do you have a reference on why sklearn does not support statistical inference? – Idyllic 13/4, 2019 at 17:40

@Idyllic because of focus. They just cannot support everything (with very limited resources), and they choose to cover different ML algorithms more fully, instead of doing other things. – Archle 14/4, 2019 at 6:47

@David Dale Do you make the point of the C term because you don't want a model that will overfit or C has a strong influence on the p_value? – Lucho 2/6, 2019 at 23:57

C has strong influence on the coefficients themselves (and through them on the p-value, of course). Scikit-learn uses C=1 by default; Statsmodels doesn't regularize at all (which is equivalent to C=infinity). Therefore, if we want scikit-learn and statsmodels to have similar coefficients, we need to set C very high in scikit-learn. – Archle 3/6, 2019 at 4:10

Can we still use this function definition even if we have a different C, say, C = 0.5? Does the wald test result differ in this case? – Monodic 18/6, 2022 at 23:29

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.