Standard errors for multivariate regression coefficients

Asked 5/1, 2014 at 19:34 Answered 28/9, 2017 at 3:45

Solved python scikit-learn linear-regression

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:

    import numpy as np
    from sklearn import linear_model
    clf = linear_model.LinearRegression()
    TST = np.vstack([x1,x2,x3,x4])
    TST = TST.transpose()
    clf.fit (TST,y)
    clf.coef_

Now, I need the standard errors for these same coefficients. How can I do that? Thanks a lot.

Derangement answered 5/1, 2014 at 19:34 Comment(0)

Based on this stats question and wikipedia, my best guess is:

MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)

However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.

Hyrcania answered 5/1, 2014 at 21:49 Comment(3)

Here, we're using MSE as an unbiased estimate for the unknown population variance. – Wellhead 10/2, 2016 at 15:8

There is an error in your formula. When computing the mean squared error (MSE) you need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_observations - df_features. The link you provided delegates this step to the Anova method used. So if we have m = TST.shape[0] observations and n = TST.shape[1] features, we would need to compute MSE as MSE = np.sum((y - clf.predict(TST).T)**2)/(m - n) or MSE = np.sum((y - clf.predict(TST).T)**2)/(m - n - 1) if you use the intercept term. – Cowled 28/9, 2017 at 2:56

Also note that as Luca pointed out, sklearn.linear_model fits the intercept by default. So you would have to add a column of ones to TST during your computations. – Cowled 28/9, 2017 at 3:10

MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)

I guess that this answer is not entirely correct. In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.

Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE

Mallina answered 1/5, 2016 at 12:20 Comment(0)

These code has been tested with data. They are correct.

find the X matrix for each data set, n is the length of dataset, m is the variables number

X, n, m=arrays(data) 
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.

sum square

s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)

standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)

sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))

(t-statistics using, linear.intercept_ for one variable)

t_stat_alpha=linear.intercept_[0]/sd_alpha[0]  #( use linear.intercept_ for one variable_

Placet answered 9/10, 2016 at 22:49 Comment(0)

I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):

# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape

# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)

Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.

The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use

print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
    print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))

Cowled answered 28/9, 2017 at 3:45 Comment(0)

-1

The example from the documentation shows how to get the mean square error and explained variance score:

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

Does this cover what you need?

Danielldaniella answered 5/1, 2014 at 20:4 Comment(0)

find the X matrix for each data set, n is the length of dataset, m is the variables number

sum square

standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)

(t-statistics using, linear.intercept_ for one variable)

Recommended topics

Hot tags