Quantile Regression in Python gives different results than in R
Asked Answered
M

4

5

QuantReg from statsmodels package in Python gives very different results than in R, using the data as shown in the following code.

I tried the STACKLOSS data in Python and R respectively, and the results were the same. I wonder if the data itself caused some issue in Python, or maybe there is some fundamental difference in the two implementations of the algorithms, but couldn't figure it out.

Code in Python:

from statsmodels.regression.quantile_regression import QuantReg
y = [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 662.59, 248.08, 331.25, 182.98, 1085.69, -44.32]
X = [
    [1, 20322.18, 0.00, 0], [1, 19653.34, 0.00, 0],
    [ 1, 0.00, 72712.41, 0], [1, 0.00, 72407.31, 0],
    [1, 0.00, 72407.31, 0], [1, 0.00, 72201.89, 9111],
    [1, 183.52, 0.00, 0], [1, 183.52, 0.00, 0],
    [1, 0.00, 0.00, 2879], [1, 0.00, 0.00, 2698],
    [1, 0.00, 0.00, 0], [1, 0.00, 0.00, 0],
    [1, 0.00, 0.00, 19358], [1, 0.00, 0.00, 19001]
]

print(QuantReg(y, X).fit(q=.5).summary())

and in R:

library(quantreg)

y <- c(0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 662.59, 248.08, 331.25, 182.98, 1085.69, -44.32)
X <- matrix(
    c(1, 20322.18, 0.00, 0, 1, 19653.34, 0.00, 0,
     1, 0.00, 72712.41, 0, 1, 0.00, 72407.31, 0,
    1, 0.00, 72407.31, 0, 1, 0.00, 72201.89, 9111,
    1, 183.52, 0.00, 0, 1, 183.52, 0.00, 0,
    1, 0.00, 0.00, 2879, 1, 0.00, 0.00, 2698,
    1, 0.00, 0.00, 0, 1, 0.00, 0.00, 0,
    1, 0.00, 0.00, 19358, 1, 0.00, 0.00, 19001),
    nrow=14, ncol=4, byrow=TRUE
)

rq(y~.-1, data=data.frame(X), tau=.5, method='fn')

R gives the the coefficients of 1.829800e+02, -9.003955e-03, -2.527093e-03, -5.697678e-05

while Python gives the following 3.339e-05, -1.671e-09, -4.635e-10, 7.957e-11

Any input or hint is appreciated.

Mcglothlin answered 2/5, 2019 at 14:20 Comment(2)
How large are the standard errors in both packages?Tetracaine
The standard deviation for the 4 coef are 161.702, 0.016, 0.003, and 0.016, for the Python version. I only see upper and lower bound for the R version in the summary, but not the standard deviation.Mcglothlin
T
3

I guess this is a data problem that the parameters are not well identified. More than half of observations have a response value of zero while all other values are much larger.

As far as I know the optimization algorithm differs between R and statsmodels especially in the treatment of observations that have close to zero residuals.

If the parameters are not well identified, that is, if the data does not provide enough information in the relevant range, then small differences in the implementation and optimization algorithm can have large effects on the parameter estimates.

This most likely means that no estimate can provide a precise parameter estimate in this case.

Tetracaine answered 2/5, 2019 at 14:39 Comment(6)
Ok, I see. Is there any workaround that can help me get Python results similar to R? For my specific use case, the R results made much more sense for predictions.Mcglothlin
statsmodels does not have a choice tor the optimization algorithm in quantile regression. So, there is no option to replicate R results for this kind of data.Tetracaine
more general question: Why do you use quantile regression for this data set? Maybe it's not the appropriate model and estimator for this dataset.Tetracaine
rescaling the variables might also help a bit.The variable in the third column works like a dummy variable with points at 0 and around 72000.Tetracaine
For comparisons mainly, as this model showed good results previously. I was hoping to thus migrate it to Python and do an apple to apple comparison with the rest. BTW, the statsmodels version's summary shows the lower and upper bounds with 95% CI, but I'm not sure if the R summary is defined as the same? Any pointer to the documentation would be very helpful, thank you!Mcglothlin
Sorry, I don't know how R computes the intervals. Inference for quantile regression in statsmodels is designed after those in Stata and I never checked the details for it in R. The intervals in statsmodels are standard Wald intervals based on standard errors as in other models. The computation of standard errors is specific to quantile regression.Tetracaine
S
2

The optimization algorithms in R and Python are quite different. QuanReg in Python estimates a quantile regression model using iterative reweighted least squares, while the R package quantreg uses the interior-point method, simplex method, and a smoothing method to solve the optimization problem.

The results must be different, however they are always close to each other. Maybe your data is not suitable to be used with the model or a certain optimization algorithm.

Syringe answered 2/11, 2019 at 2:53 Comment(0)
C
2

I noticed the same thing. For me it simply seemed to be a numerical/scaling issue. For both Python and R I transformed all values to z-scores and afterward both sets of betas were near-identical, although the SEs were still different. In the Python version I also had a warning "The condition number is large, 5.66e+06. This might indicate that there are strong multicollinearity or other numerical problems."

I realize this question is almost 2 years old now, but I don't think any of the other answers mentioned this, so hopefully this helps any new readers.

Cymar answered 12/4, 2021 at 6:22 Comment(0)
L
0

You probably figured out by now but you need to add constant yourself in the python QuantReg package. Once you use sm.addconstant you should get the same results.

Leslileslie answered 5/11, 2020 at 16:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.