Scikit-Learn: Std.Error, p-Value from LinearRegression
Asked Answered
V

1

6

I've been trying to get the standard error & p-Values by using LR from scikit-learn. But no success.

I've end up finding up this article: but the std error & p-value does not match that from the statsmodel.api OLS method

import numpy as np 
from sklearn import datasets
from sklearn import linear_model
import regressor
import statsmodels.api as sm 


boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False
X = boston.data[:,which_betas]
y = boston.target

#scikit + regressor stats
ols = linear_model.LinearRegression()
ols.fit(X,y)

xlables = boston.feature_names[which_betas]
regressor.summary(ols, X, y, xlables)


# statsmodel
x2 = sm.add_constant(X)
models = sm.OLS(y,x2)
result = models.fit()
print result.summary()

Output as follows:

Residuals:
Min      1Q  Median      3Q      Max
-26.3743 -1.9207  0.6648  2.8112  13.3794


Coefficients:
             Estimate  Std. Error  t value   p value
_intercept  36.925033    4.915647   7.5117  0.000000
CRIM        -0.112227    0.031583  -3.5534  0.000416
ZN           0.047025    0.010705   4.3927  0.000014
INDUS        0.040644    0.055844   0.7278  0.467065
NOX        -17.396989    3.591927  -4.8434  0.000002
RM           3.845179    0.272990  14.0854  0.000000
AGE          0.002847    0.009629   0.2957  0.767610
DIS         -1.485557    0.180530  -8.2289  0.000000
RAD          0.327895    0.061569   5.3257  0.000000
TAX         -0.013751    0.001055 -13.0395  0.000000
PTRATIO     -0.991733    0.088994 -11.1438  0.000000
B            0.009827    0.001126   8.7256  0.000000
LSTAT       -0.534914    0.042128 -12.6973  0.000000
---
R-squared:  0.73547,    Adjusted R-squared:  0.72904
F-statistic: 114.23 on 12 features
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.735
Model:                            OLS   Adj. R-squared:                  0.729
Method:                 Least Squares   F-statistic:                     114.2
Date:                Sun, 21 Aug 2016   Prob (F-statistic):          7.59e-134
Time:                        21:56:26   Log-Likelihood:                -1503.8
No. Observations:                 506   AIC:                             3034.
Df Residuals:                     493   BIC:                             3089.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         36.9250      5.148      7.173      0.000        26.811    47.039
x1            -0.1122      0.033     -3.405      0.001        -0.177    -0.047
x2             0.0470      0.014      3.396      0.001         0.020     0.074
x3             0.0406      0.062      0.659      0.510        -0.081     0.162
x4           -17.3970      3.852     -4.516      0.000       -24.966    -9.828
x5             3.8452      0.421      9.123      0.000         3.017     4.673
x6             0.0028      0.013      0.214      0.831        -0.023     0.029
x7            -1.4856      0.201     -7.383      0.000        -1.881    -1.090
x8             0.3279      0.067      4.928      0.000         0.197     0.459
x9            -0.0138      0.004     -3.651      0.000        -0.021    -0.006
x10           -0.9917      0.131     -7.547      0.000        -1.250    -0.734
x11            0.0098      0.003      3.635      0.000         0.005     0.015
x12           -0.5349      0.051    -10.479      0.000        -0.635    -0.435
==============================================================================
Omnibus:                      190.837   Durbin-Watson:                   1.015
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              897.143
Skew:                           1.619   Prob(JB):                    1.54e-195
Kurtosis:                       8.663   Cond. No.                     1.51e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

I've also found the following articles

Both the codes in the SO link doesn't compile

Here is my code & data that I'm working on - but not being able to find the std error & p-values

import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy
from sklearn.linear_model import LinearRegression
from sklearn import metrics 


def readFile(filename, sheetname):
    xlsx = pd.ExcelFile(filename)
    data = xlsx.parse(sheetname, skiprows=1)
    return data


def lr_statsmodel(X,y):
    X = sm.add_constant(X)
    model = sm.OLS(y,X)
    results = model.fit()
    print (results.summary())


def lr_scikit(X,y,featureCols):
    model = LinearRegression()
    results = model.fit(X,y)

    predictions =  results.predict(X)

    print 'Coefficients'
    print 'Intercept\t' , results.intercept_
    df = pd.DataFrame(zip(featureCols, results.coef_))
    print df.to_string(index=False, header=False)

    # Query:: The numbers matches with Excel OLS but skeptical about relating score as rsquared
    rSquare = results.score(X,y)
    print '\nR-Square::', rSquare

    # This looks like a better option
    # source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
    r2 = metrics.r2_score(y,results.predict(X))
    print 'r2', r2

    # Query: No clue at all! http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics 
    print 'Rsquared?!' , metrics.explained_variance_score(y, results.predict(X))
    # INFO:: All three of them are providing the same figures!     


    # Adj-Rsquare formula @ https://www.easycalculation.com/statistics/learn-adjustedr2.php
    # In ML, we don't use all of the data for training, and hence its highly unusual to find AdjRsquared. Thus the need for manual calculation
    N = X.shape[0]
    p = X.shape[1]
    adjRsquare = 1 - ((1 -  rSquare ) * (N - 1) / (N - p - 1))
    print "Adjusted R-Square::", adjRsquare

    # calculate standard errors
    # mean_absolute_error
    # mean_squared_error
    # median_absolute_error 
    # r2_score
    # explained_variance_score
    mse = metrics.mean_squared_error(y,results.predict(X))
    print mse
    print 'Residual Standard Error:', np.sqrt(mse)

    # OLS in Matrix : https://github.com/nsh87/regressors/blob/master/regressors/stats.py
    n = X.shape[0]
    X1 = np.hstack((np.ones((n, 1)), np.matrix(X)))    
    se_matrix = scipy.linalg.sqrtm(
        metrics.mean_squared_error(y, results.predict(X)) *
        np.linalg.inv(X1.T * X1)
    )
    print 'se',np.diagonal(se_matrix)

#    https://github.com/nsh87/regressors/blob/master/regressors/stats.py
#    http://regressors.readthedocs.io/en/latest/usage.html

    y_hat = results.predict(X)
    sse = np.sum((y_hat - y) ** 2)
    print 'Standard Square Error of the Model:', sse




if __name__ == '__main__':

    # read file 
    fileData = readFile('Linear_regression.xlsx','Input Data')

    # list of independent variables 
    feature_cols = ['Price per week','Population of city','Monthly income of riders','Average parking rates per month']

    # build dependent & independent data set 
    X = fileData[feature_cols]
    y = fileData['Number of weekly riders']

    # Statsmodel - OLS 
#    lr_statsmodel(X,y)

    # ScikitLearn - OLS 
    lr_scikit(X,y,feature_cols)

My data-set

Y   X1  X2  X3  X4
City    Number of weekly riders Price per week  Population of city  Monthly income of riders    Average parking rates per month
1   1,92,000    $15     18,00,000   $5,800  $50
2   1,90,400    $15     17,90,000   $6,200  $50
3   1,91,200    $15     17,80,000   $6,400  $60
4   1,77,600    $25     17,78,000   $6,500  $60
5   1,76,800    $25     17,50,000   $6,550  $60
6   1,78,400    $25     17,40,000   $6,580  $70
7   1,80,800    $25     17,25,000   $8,200  $75
8   1,75,200    $30     17,25,000   $8,600  $75
9   1,74,400    $30     17,20,000   $8,800  $75
10  1,73,920    $30     17,05,000   $9,200  $80
11  1,72,800    $30     17,10,000   $9,630  $80
12  1,63,200    $40     17,00,000   $10,570 $80
13  1,61,600    $40     16,95,000   $11,330 $85
14  1,61,600    $40     16,95,000   $11,600 $100
15  1,60,800    $40     16,90,000   $11,800 $105
16  1,59,200    $40     16,30,000   $11,830 $105
17  1,48,800    $65     16,40,000   $12,650 $105
18  1,15,696    $102    16,35,000   $13,000 $110
19  1,47,200    $75     16,30,000   $13,224 $125
20  1,50,400    $75     16,20,000   $13,766 $130
21  1,52,000    $75     16,15,000   $14,010 $150
22  1,36,000    $80     16,05,000   $14,468 $155
23  1,26,240    $86     15,90,000   $15,000 $165
24  1,23,888    $98     15,95,000   $15,200 $175
25  1,26,080    $87     15,90,000   $15,600 $175
26  1,51,680    $77     16,00,000   $16,000 $190
27  1,52,800    $63     16,10,000   $16,200 $200

I've exhausted all my options and whatever I could make sense of. So any guidance on how to compute std error & p-values that is the same as per the statsmodel.api is appreciated.

EDIT: I'm trying to find the std error & p-values for intercept and all the independent variables

Vickeyvicki answered 21/8, 2016 at 16:48 Comment(4)
Were you able to understand the difference here? Could this be because you are using a package outside of sklearn. You are using regressor for computations on top of sklearn results. It could be different our statsmodel does it.Streptococcus
Old thread, but I also encountered the same problem. regressors uses the built-in method sklearn.metrics.mean_squared_error to compute for the MSE (which is used to compute for the p-values), but this uses a divisor of n instead of n-p, where n is the sample size and p is the number of features. This causes discrepancy with statsmodels in cases where p is not small compared to n.Gwinn
@Gwinn What are the scenarios where one method of calculation would be better than the other?Spinozism
@DonQuixote statsmodels works better since it uses the divisor n-p to give an unbiased estimate. If n is very large, then you should get comparable results.Gwinn
M
2

Here is reg is output of lin regression fit method of sklearn

to calculate adjusted r2

def adjustedR2(x,y reg):
  r2 = reg.score(x,y)
  n = x.shape[0]
  p = x.shape[1]
  adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
  return adjusted_r2

and for p values

from sklearn.feature_selection import f_regression

freg=f_regression(x,y)

p=freg[1]

print(p.round(3))
Masorete answered 23/5, 2021 at 1:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.