Quantile Regression in Python gives different results than in R

Asked 2/5, 2019 at 14:20 Answered 12/4, 2021 at 6:22

QuantReg from statsmodels package in Python gives very different results than in R, using the data as shown in the following code.

I tried the STACKLOSS data in Python and R respectively, and the results were the same. I wonder if the data itself caused some issue in Python, or maybe there is some fundamental difference in the two implementations of the algorithms, but couldn't figure it out.

Code in Python:

from statsmodels.regression.quantile_regression import QuantReg
y = [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 662.59, 248.08, 331.25, 182.98, 1085.69, -44.32]
X = [
    [1, 20322.18, 0.00, 0], [1, 19653.34, 0.00, 0],
    [ 1, 0.00, 72712.41, 0], [1, 0.00, 72407.31, 0],
    [1, 0.00, 72407.31, 0], [1, 0.00, 72201.89, 9111],
    [1, 183.52, 0.00, 0], [1, 183.52, 0.00, 0],
    [1, 0.00, 0.00, 2879], [1, 0.00, 0.00, 2698],
    [1, 0.00, 0.00, 0], [1, 0.00, 0.00, 0],
    [1, 0.00, 0.00, 19358], [1, 0.00, 0.00, 19001]
]

print(QuantReg(y, X).fit(q=.5).summary())

and in R:

library(quantreg)

y <- c(0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 662.59, 248.08, 331.25, 182.98, 1085.69, -44.32)
X <- matrix(
    c(1, 20322.18, 0.00, 0, 1, 19653.34, 0.00, 0,
     1, 0.00, 72712.41, 0, 1, 0.00, 72407.31, 0,
    1, 0.00, 72407.31, 0, 1, 0.00, 72201.89, 9111,
    1, 183.52, 0.00, 0, 1, 183.52, 0.00, 0,
    1, 0.00, 0.00, 2879, 1, 0.00, 0.00, 2698,
    1, 0.00, 0.00, 0, 1, 0.00, 0.00, 0,
    1, 0.00, 0.00, 19358, 1, 0.00, 0.00, 19001),
    nrow=14, ncol=4, byrow=TRUE
)

rq(y~.-1, data=data.frame(X), tau=.5, method='fn')

R gives the the coefficients of 1.829800e+02, -9.003955e-03, -2.527093e-03, -5.697678e-05

while Python gives the following 3.339e-05, -1.671e-09, -4.635e-10, 7.957e-11

Any input or hint is appreciated.

Mcglothlin answered 2/5, 2019 at 14:20 Comment(2)

How large are the standard errors in both packages? – Tetracaine 2/5, 2019 at 14:39

The standard deviation for the 4 coef are 161.702, 0.016, 0.003, and 0.016, for the Python version. I only see upper and lower bound for the R version in the summary, but not the standard deviation. – Mcglothlin 2/5, 2019 at 14:45

I guess this is a data problem that the parameters are not well identified. More than half of observations have a response value of zero while all other values are much larger.

As far as I know the optimization algorithm differs between R and statsmodels especially in the treatment of observations that have close to zero residuals.

If the parameters are not well identified, that is, if the data does not provide enough information in the relevant range, then small differences in the implementation and optimization algorithm can have large effects on the parameter estimates.

This most likely means that no estimate can provide a precise parameter estimate in this case.

Tetracaine answered 2/5, 2019 at 14:39 Comment(6)

Ok, I see. Is there any workaround that can help me get Python results similar to R? For my specific use case, the R results made much more sense for predictions. – Mcglothlin 2/5, 2019 at 14:48

statsmodels does not have a choice tor the optimization algorithm in quantile regression. So, there is no option to replicate R results for this kind of data. – Tetracaine 2/5, 2019 at 15:14

more general question: Why do you use quantile regression for this data set? Maybe it's not the appropriate model and estimator for this dataset. – Tetracaine 2/5, 2019 at 15:20

rescaling the variables might also help a bit.The variable in the third column works like a dummy variable with points at 0 and around 72000. – Tetracaine 2/5, 2019 at 15:21

For comparisons mainly, as this model showed good results previously. I was hoping to thus migrate it to Python and do an apple to apple comparison with the rest. BTW, the statsmodels version's summary shows the lower and upper bounds with 95% CI, but I'm not sure if the R summary is defined as the same? Any pointer to the documentation would be very helpful, thank you! – Mcglothlin 2/5, 2019 at 15:27

Sorry, I don't know how R computes the intervals. Inference for quantile regression in statsmodels is designed after those in Stata and I never checked the details for it in R. The intervals in statsmodels are standard Wald intervals based on standard errors as in other models. The computation of standard errors is specific to quantile regression. – Tetracaine 2/5, 2019 at 16:19

The optimization algorithms in R and Python are quite different. QuanReg in Python estimates a quantile regression model using iterative reweighted least squares, while the R package quantreg uses the interior-point method, simplex method, and a smoothing method to solve the optimization problem.

The results must be different, however they are always close to each other. Maybe your data is not suitable to be used with the model or a certain optimization algorithm.

Syringe answered 2/11, 2019 at 2:53 Comment(0)

I noticed the same thing. For me it simply seemed to be a numerical/scaling issue. For both Python and R I transformed all values to z-scores and afterward both sets of betas were near-identical, although the SEs were still different. In the Python version I also had a warning "The condition number is large, 5.66e+06. This might indicate that there are strong multicollinearity or other numerical problems."

I realize this question is almost 2 years old now, but I don't think any of the other answers mentioned this, so hopefully this helps any new readers.

Cymar answered 12/4, 2021 at 6:22 Comment(0)

You probably figured out by now but you need to add constant yourself in the python QuantReg package. Once you use sm.addconstant you should get the same results.

Leslileslie answered 5/11, 2020 at 16:11 Comment(0)

Recommended topics

Hot tags