How to get R-squared for robust regression (RLM) in Statsmodels?
Asked Answered
N

3

16

When it comes to measuring goodness of fit - R-Squared seems to be a commonly understood (and accepted) measure for "simple" linear models. But in case of statsmodels (as well as other statistical software) RLM does not include R-squared together with regression results. Is there a way to get it calculated "manually", perhaps in a way similar to how it is done in Stata?

Or is there another measure that can be used / calculated from the results produced by sm.RLS?

This is what Statsmodels is producing:

import numpy as np
import statsmodels.api as sm

# Sample Data with outliers
nsample = 50
x = np.linspace(0, 20, nsample)
x = sm.add_constant(x)
sig = 0.3
beta = [5, 0.5]
y_true = np.dot(x, beta)
y = y_true + sig * 1. * np.random.normal(size=nsample)
y[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)

# Regression with Robust Linear Model
res = sm.RLM(y, x).fit()
print(res.summary())

Which outputs:

                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   50
Model:                            RLM   Df Residuals:                       48
Method:                          IRLS   Df Model:                            1
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                 Mo, 27 Jul 2015                                         
Time:                        10:00:00                                         
No. Iterations:                    17                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          5.0254      0.091     55.017      0.000         4.846     5.204
x1             0.4845      0.008     61.555      0.000         0.469     0.500
==============================================================================
Nought answered 27/7, 2015 at 14:0 Comment(6)
Since RLM is estimated with iterative weighted least squares, you could try to replicate the WLS instance wls_results = WLS(mod.endog, mod.exog, weights=mod.weights).fit() where mod is the RLM model after fit. No guarantees for this. The rsquared of a WLS results has the rsquared for the weighted residuals which would be the measure that downweights the outliers. However, I don't think you can compare models by rsquared if they differ by the weights.Partizan
The proper answer would be here github.com/statsmodels/statsmodels/pull/1341 which includes rsquared based on the definition in SAS.Partizan
Thanks, it did help to get R2=0.948 with mod = sm.RLS(y, x); r2_wls = sm.WLS(mod.endog, mod.exog, weights=mod.fit().weights).fit().rsquared. Compare to R2 of OLS=0.731. Looks like "too good to be true" :-)Nought
Thank for the link - did not see it when searched github for a similar issue. The function that is in the patch there is producing R2=0.721. Slightly lower than R2 of OLS... But BIC went down from 181 to 177 (is it a significant shift?). Is there other measure to prove that RLS clearly and numerically shows "better fit"?Nought
I just found this also stat.ethz.ch/pipermail/r-help/2008-November/179773.html . First, the PR 1341 fixes also some bugs in robust that are not used in the current RLM, but were needed for the extensions. rsquared in 1341 is a Pseudo-rsquared based on the likelihood (or M-estimation objective function) not on residual sum of squares, and AIC of OLS is based on the normal distribution. I haven't looked at this in a while, but showing "better fit" is a bit ambiguous because RLM downweights all observations that "don't fit", and treats them as outliers.Partizan
Thank you for the link - my takeout from that discussion is that instead of R2 one could use "scale estimate of the residuals ... which determines the prediction accuracy". And it seems that scale is accessible in RLS results via .scale property. However I could not find any explanation on how to interpret this parameter and what it actually means. While searching I came across a couple of papers that might be of interest: Robust version of R-squared, Robust AICNought
B
3

Since an OLS return the R2, I would suggest regressing the actual values against the fitted values using simple linear regression. Regardless where the fitted values come from, such an approach would provide you an indication of the corresponding R2.

Banded answered 19/3, 2019 at 21:36 Comment(0)
C
2

R2 is not a good measure of goodness of fit for RLM models. The problem is that the outliers have a huge effect on the R2 value, to the point where it is completely determined by outliers. Using weighted regression afterwards is an attractive alternative, but it is better to look at the p-values, standard errors and confidence intervals of the estimated coefficients.

Comparing the OLS summary to RLM (results are slightly different to yours due to different randomization):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.726
Model:                            OLS   Adj. R-squared:                  0.721
Method:                 Least Squares   F-statistic:                     127.4
Date:                Wed, 03 Nov 2021   Prob (F-statistic):           4.15e-15
Time:                        09:33:40   Log-Likelihood:                -87.455
No. Observations:                  50   AIC:                             178.9
Df Residuals:                      48   BIC:                             182.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.7071      0.396     14.425      0.000       4.912       6.503
x1             0.3848      0.034     11.288      0.000       0.316       0.453
==============================================================================
Omnibus:                       23.499   Durbin-Watson:                   2.752
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.906
Skew:                          -1.649   Prob(JB):                     4.34e-08
Kurtosis:                       5.324   Cond. No.                         23.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   50
Model:                            RLM   Df Residuals:                       48
Method:                          IRLS   Df Model:                            1
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Wed, 03 Nov 2021                                         
Time:                        09:34:24                                         
No. Iterations:                    17                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1857      0.111     46.590      0.000       4.968       5.404
x1             0.4790      0.010     49.947      0.000       0.460       0.498
==============================================================================

If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .

You can see that the standard errors and size of the confidence interval decreases in going from OLS to RLM for both the intercept and the slope term. This suggests that the estimates are closer to the real values.

Crossroad answered 3/11, 2021 at 9:35 Comment(0)
C
1

Why not use model.predict to obtain the r2? For Example:

r2=1. - np.sum(np.abs(model.predict(X) - y) **2) / np.sum(np.abs(y - np.mean(y)) ** 2)
Costmary answered 4/1, 2020 at 5:39 Comment(2)
That will be dominated by outliers.Partizan
@Partizan - Usually, I would use the WLS mechanism and compare R2 values (or study specific metric) on outsample data. Is there a better a mechanism?Costmary

© 2022 - 2024 — McMap. All rights reserved.