Robustness issue of statsmodel Linear regression (ols) - Python
Asked Answered
F

1

2

I was testing some basic category regression using Stats model: I build up a deterministic model

Y = X + Z

where X can takes 3 values (a, b or c) and Z only 2 (d or e). At that stage the model is purely deterministic, I setup the weights for each variable as followed

a's weight=1

b's weight=2

c's weight=3

d's weight=1

e's weight=2

Therefore with 1(X=a) being 1 if X=a, 0 otherwise, the model is simply:

Y = 1(X=a) + 2*1(X=b) + 3*1(X=c) + 1(Z=d) + 2*1(Z=e)

Using the following code, to generate the different variables and run the regression

from statsmodels.formula.api import ols
nbData = 1000
rand1 = np.random.uniform(size=nbData)
rand2 = np.random.uniform(size=nbData)
a = 1 * (rand1 <= (1.0/3.0))
b = 1 * (((1.0/3.0)< rand1) & (rand1< (4/5.0)))
c = 1-b-a
d = 1 * (rand2 <= (3.0/5.0))
e = 1-d
weigths = [1,2,3,1,2]
y = a+2*b+3*c+4*d+5*e
df = pd.DataFrame({'y':y, 'a':a, 'b':b, 'c':c, 'd':d, 'e':e})

mod = ols(formula='y ~ a + b + c + d + e - 1', data=df)
res = mod.fit()
print(res.summary())

I end up with the rights results (one has to look at the difference between coef rather than the coef themselfs)

                           OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.006e+30
Date:                Wed, 16 Sep 2015   Prob (F-statistic):               0.00
Time:                        03:05:40   Log-Likelihood:                 3156.8
No. Observations:                 100   AIC:                            -6306.
Df Residuals:                      96   BIC:                            -6295.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
a              1.6000   7.47e-16   2.14e+15      0.000         1.600     1.600
b              2.6000   6.11e-16   4.25e+15      0.000         2.600     2.600
c              3.6000   9.61e-16   3.74e+15      0.000         3.600     3.600
d              3.4000   5.21e-16   6.52e+15      0.000         3.400     3.400
e              4.4000   6.85e-16   6.42e+15      0.000         4.400     4.400
==============================================================================
Omnibus:                       11.299   Durbin-Watson:                   0.833
Prob(Omnibus):                  0.004   Jarque-Bera (JB):                5.720
Skew:                          -0.381   Prob(JB):                       0.0573
Kurtosis:                       2.110   Cond. No.                     2.46e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.67e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

But when I increase the number of data point to (say) 600, the regression is producing really bad results. I have tried similar regression in Excel and in R and they are producing consistent results whatever the number of data points. Does anyone know if there is some restriction on statsmodel ols explaining such behaviour or am I missing something?

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.167
Model:                            OLS   Adj. R-squared:                  0.161
Method:                 Least Squares   F-statistic:                     29.83
Date:                Wed, 16 Sep 2015   Prob (F-statistic):           1.23e-22
Time:                        03:08:04   Log-Likelihood:                -701.02
No. Observations:                 600   AIC:                             1412.
Df Residuals:                     595   BIC:                             1434.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
a              5.8070   1.15e+13   5.05e-13      1.000     -2.26e+13  2.26e+13
b              6.4951   1.15e+13   5.65e-13      1.000     -2.26e+13  2.26e+13
c              6.9033   1.15e+13   6.01e-13      1.000     -2.26e+13  2.26e+13
d             -1.1927   1.15e+13  -1.04e-13      1.000     -2.26e+13  2.26e+13
e             -0.1685   1.15e+13  -1.47e-14      1.000     -2.26e+13  2.26e+13
==============================================================================
Omnibus:                       67.153   Durbin-Watson:                   0.328
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               70.964
Skew:                           0.791   Prob(JB):                     3.89e-16
Kurtosis:                       2.419   Cond. No.                     7.70e+14
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.25e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Floater answered 16/9, 2015 at 3:13 Comment(5)
Can you post the Excel or R code too? My expectation is that the multicollinearity problem should occur: if you have a variable that can take on N distinct categorical values, this should be represented with N-1 column dummies, not N column dummies, because given the first N-1 columns, the Nth column is fully determined. For example, your variable Z is fully determined by a single dummy column. I'm surprised that other regression tools are giving you consistent results.Nevile
nice example, I've never seen a case like this. My guess is that you are just hitting the threshold for the generalized inverse, numpy.pinv. In the first case the solution is reduced rank with df_model equal to 3. In the second case there is enough numerical noise that the rank is identified with df_model equal to 4. This might be a case in favor of lowering the pinv threshold from numpy's default value. R might produce more stable although equally unidentified results because it uses a pivoting QR decomposition by default, AFAIK. Stata drops perfectly colinear columns.Dilan
It's still a bit surprising. What is the dtype of your data, i.e. mod.exog.dtype?Dilan
dtype is float64 for me with a failing example. The threshold for pinv is just a bit too low, np.linalg.pinv(mod.exog, rcond=1.5e-15).dot(mod.endog) instead of default 1e-15 would fix it for the random case I got. But as the warning indicates and Mr. F pointed out, it is fragile to rely on precision issues in singular or near singular cases.Dilan
@Mr. F: For excel I simply use linest and for R the code is simply using myreg <- lm( y ~ a + b + c + d + e, data = data). In R, it recognized that there is colinearity, and it gets rid of the Nth dummy variable (R output for the reg slope param is a=2, b=1, c=NA, d=1, e=NA).Floater
F
3

It appears that as mentionned by Mr. F, the main problem is that the statsmodel OLS does not seem to handle the collinearity pb as well as Excel/R in that case, but if instead of defining one variable for each a, b, c, d and e, one define a variable X and one Z which can be equal to a, b or c and d or e resp, then the regression works fine. Ie updating the code with :

df['X'] = ['c']*len(df)
df.X[df.b!=0] = 'b'
df.X[df.a!=0] = 'a'
df['Z'] = ['e']*len(df)
df.Z[df.d!=0] = 'd'
mod = ols(formula='y ~ X + Z - 1', data=df)

leads to the expected results

                           OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.684e+27
Date:                Thu, 17 Sep 2015   Prob (F-statistic):               0.00
Time:                        06:22:43   Log-Likelihood:             2.5096e+06
No. Observations:              100000   AIC:                        -5.019e+06
Df Residuals:                   99996   BIC:                        -5.019e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
X[a]           5.0000   1.85e-14    2.7e+14      0.000         5.000     5.000
X[b]           6.0000   1.62e-14   3.71e+14      0.000         6.000     6.000
X[c]           7.0000   2.31e-14   3.04e+14      0.000         7.000     7.000
Z[T.e]         1.0000   1.97e-14   5.08e+13      0.000         1.000     1.000
==============================================================================
Omnibus:                      145.367   Durbin-Watson:                   1.353
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             9729.487
Skew:                          -0.094   Prob(JB):                         0.00
Kurtosis:                       1.483   Cond. No.                         2.29
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Floater answered 17/9, 2015 at 6:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.