Let's take a step back and consider the following sentence you said:
since just setting fit_intercept True/False gives different result for Linear Regression
That is not entirely true. It may or may not be different, and it depends entirely on your data. It would help to understand what goes into the calculation of regression weights. I mean this somewhat literally: what does your input (x
) data look like?
Understanding your input data, and understanding why it matters, will help you realize why you sometimes get different results, and why at other times the results are the same
Data setup
Lets set up some test data:
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(1243)
x = np.random.randint(0,100,size=10)
y = np.random.randint(0,100,size=10)
Our x
and y
variables look like this:
X Y
51 29
3 73
7 77
98 29
29 80
90 37
49 9
42 53
8 17
65 35
No-intercept model
Recall that the calculation of regression weights has a closed form solution, which we can obtain using normal equations:
Using this method, we get a single regression coefficient because we only have 1 predictor variable:
x = x.reshape(-1,1)
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 0.53297593]
Now, let's look at scikit-learn when we set fit_intercept = False
:
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 0.53297593]
What happens when we set fit_intercept = True
instead?
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[-0.35535884]
It would seem that setting fit_intercept
to True and False gives different answers, and that the "correct" answer occurs only when we set it to False, but this is not entirely correct...
Intercept model
At this point we have to consider what our input data actually is. In the models above, our data matrix (also called a feature matrix, or design matrix in statistics) is just a single vector containing our x
values. The y
variable is not included in the design matrix. If we want to add an intercept to our model, one common approach is to add a column of 1's to the design matrix, so x
becomes:
x_vals = x.flatten()
x = np.zeros((10, 2))
x[:,0] = 1
x[:,1] = x_vals
intercept x
0 1.0 51.0
1 1.0 3.0
2 1.0 7.0
3 1.0 98.0
4 1.0 29.0
5 1.0 90.0
6 1.0 49.0
7 1.0 42.0
8 1.0 8.0
9 1.0 65.0
Now, when we use this as our design matrix, we can try the closed form solution again:
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 59.60686058 -0.35535884]
Notice 2 things:
- We now have 2 coefficients. The first is our intercept and the second is the regression coefficient for the
x
predictor variable
- The coefficient for
x
matches the coefficient from the scikit-learn output above when we set fit_intercept = True
So in the scikit-learn models above, why was there a difference between True and False? Because in one case no intercept was modeled. In the other case the underlying model included an intercept, which is confirmed when you manually add an intercept term/column when solving the normal equations
If you were to use this new design matrix in scikit-learn, it doesn't matter whether you set True or False for fit_intercept
, the coefficient for the predictor variable will not change (the intercept value will be different due to centering, but thats irrelevant for this discussion):
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 59.60686058 -0.35535884]
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[ 0. -0.35535884]
Summing up
The output (i.e. coefficient values) you get will be entirely dependent on the matrix that you input into these calculations (whether its normal equations, scikit-learn, or any other)
How does it differ to use fit_intercept in a model and when should one set it to True/False
If your design matrix does not contain a 1's column, then normal equations and scikit-learn (fit_intercept = False
) will give you the same answer (as you noted). However, if you set the parameter to True, the answer you get will actually be the same as normal equations if you calculated that with a 1's column.
When should you set True/False? As the name suggests, you set False when you don't want to include an intercept in your model. You set True when you do want an intercept, with the understanding that the coefficient values will change, but will match the normal equations approach when your data includes a 1's column
So True/False doesn't actually give you different results (compared to normal equations) when considering the same underlying model. The difference you observe is because you're looking at two different statistical models (one with an intercept term, and one without). The reason the fit_intercept
parameter exists is so you can create an intercept model without the hassle of manually adding that 1's column. It effectively allows you to toggle between the two underlying statistical models.