Linear Regression vs Closed form Ordinary least squares in Python

I am trying to apply Linear Regression method for a dataset of 9 sample with around 50 features using python. I have tried different methodology for Linear Regression i.e Closed form OLS(Ordinary Least Squares), LR(Linear Regression), HR(Huber Regression), NNLS( Non negative least squares) and each of them gives different weights.

But I can get the intuition why HR and NNLS has different solution, but LR and Closed form OLS have the same objective function of minimizing the sum of the squares of the differences between observed value in the given sample and those predicted by a linear function of a set of features. Since the training set is singular, i had to use pseudoinverse to perform Closed form OLS.

w = np.dot(train_features.T, train_features)  
w1 = np.dot(np.linalg.pinv(w), np.dot(train_features.T,train_target))

For LR i have used scikit-learn Linear Regression uses lapack library from www.netlib.org to solve the least-squares problem

       linear_model.LinearRegression()

System of linear equations or a system of polynomial equations is referred as underdetermined if no of equations available are less than unknown parameters. Each unknown parameter can be counted as an available degree of freedom. Each equation presented can be applied as a constraint that restricts one degree of freedom. As a result an underdetermined system can have infinitely many solutions or no solution at all. Since in our case study, system is underdetermined and also is singular, there exists many solutions.

Now both pseudoinverse and Lapack library tries to finds minimum norm solution of an underdetermined system when no of sample is less than no of features. Then why the closed form and LR gives completely different solution of the same system of linear equations. Am i missing something here which can explain the behaviors of both ways. Like if the peudoinverse is computed in different ways like SVD, QR/LQ factorization, can they produce different solution for same set of equations?

import numpy as np from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression X, y = load_boston(return_X_y=True) """ OLS custom """ w = np.dot(np.linalg.pinv(X), y) print('custom') print(w) """ sklearn's LinearRegression (default) """ clf = LinearRegression() print('sklearn default') print(clf.fit(X, y).coef_) """ sklearn's LinearRegression (no intercept-fitting) """ print('sklearn fit_intercept=False') clf = LinearRegression(fit_intercept=False) print(clf.fit(X, y).coef_)

custom [ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00 -2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01 1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02 -4.16972624e-01] sklearn default [ -1.07170557e-01 4.63952195e-02 2.08602395e-02 2.68856140e+00 -1.77957587e+01 3.80475246e+00 7.51061703e-04 -1.47575880e+00 3.05655038e-01 -1.23293463e-02 -9.53463555e-01 9.39251272e-03 -5.25466633e-01] sklearn fit_intercept=False [ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00 -2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01 1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02 -4.16972624e-01]

Recommended topics

Hot tags