Differences in Linear Regression in R and Python [closed]

Asked 2/12, 2016 at 15:25 Answered 3/12, 2016 at 18:39

python r scikit-learn linear-regression lm

I was trying to match the linear regression R results with that of python

Matching the coefficients for each of independent variable

and below is the code:

Data is uploaded.

https://www.dropbox.com/s/oowe4irm9332s78/X.csv?dl=0

https://www.dropbox.com/s/79scp54unzlbwyk/Y.csv?dl=0

R code:

#define pathname = " "

X <- read.csv(file.path(pathname,"X.csv"),stringsAsFactors = F)
Y <- read.csv(file.path(pathname,"Y.csv"),stringsAsFactors = F)

d1 = Y$d1

reg_data <- cbind(d1, X)
head(reg_data)

reg_model <- lm(d1 ~ .,data=reg_data)

names(which(is.na(reg_model$coefficients)))

reg_model$coefficients

R Result

> summary(reg_model)

Call:
lm(formula = d1 ~ ., data = reg_data)

Residuals:
ALL 60 residuals are 0: no residual degrees of freedom!

Coefficients: (14 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.37752         NA      NA       NA
v1            1.30214         NA      NA       NA
v2           -2.93118         NA      NA       NA
v3            7.58902         NA      NA       NA
v4           11.88570         NA      NA       NA
v5            1.60622         NA      NA       NA
v6            3.71528         NA      NA       NA
v7           -9.34627         NA      NA       NA
v8           -3.84694         NA      NA       NA
v9           -2.51332         NA      NA       NA
v10           4.22403         NA      NA       NA
v11          -9.70126         NA      NA       NA
v12                NA         NA      NA       NA
v13           4.67276         NA      NA       NA
v14          -6.57924         NA      NA       NA
v15          -3.68065         NA      NA       NA
v16           5.25168         NA      NA       NA
v17          14.60444         NA      NA       NA
v18          16.00679         NA      NA       NA
v19          24.79622         NA      NA       NA
v20          13.85774         NA      NA       NA
v21           2.16022         NA      NA       NA
v22         -36.65361         NA      NA       NA
v23           2.26554         NA      NA       NA
v24                NA         NA      NA       NA
v25                NA         NA      NA       NA
v26           7.00981         NA      NA       NA
v27           0.88904         NA      NA       NA
v28           0.34400         NA      NA       NA
v29          -5.27597         NA      NA       NA
v30           5.21034         NA      NA       NA
v31           6.79640         NA      NA       NA
v32           2.96346         NA      NA       NA
v33          -1.52702         NA      NA       NA
v34          -2.74632         NA      NA       NA
v35          -2.36952         NA      NA       NA
v36          -7.76547         NA      NA       NA
v37           2.19630         NA      NA       NA
v38           1.63336         NA      NA       NA
v39           0.69485         NA      NA       NA
v40           0.37379         NA      NA       NA
v41          -0.09107         NA      NA       NA
v42           2.06569         NA      NA       NA
v43           1.57505         NA      NA       NA
v44           2.70535         NA      NA       NA
v45           1.17634         NA      NA       NA
v46         -10.51141         NA      NA       NA
v47          -1.15060         NA      NA       NA
v48           2.87353         NA      NA       NA
v49           3.37740         NA      NA       NA
v50          -5.89816         NA      NA       NA
v51           0.85851         NA      NA       NA
v52           3.73929         NA      NA       NA
v53           4.93265         NA      NA       NA
v54           3.45650         NA      NA       NA
v55           0.12382         NA      NA       NA
v56          -0.21171         NA      NA       NA
v57           4.37199         NA      NA       NA
v58           3.21456         NA      NA       NA
v59           0.09012         NA      NA       NA
v60          -0.85414         NA      NA       NA
v61          -3.29856         NA      NA       NA
v62           4.38842         NA      NA       NA
v63                NA         NA      NA       NA
v64                NA         NA      NA       NA
v65                NA         NA      NA       NA
v66                NA         NA      NA       NA
v67                NA         NA      NA       NA
v68                NA         NA      NA       NA
v69                NA         NA      NA       NA
v70                NA         NA      NA       NA
v71                NA         NA      NA       NA
v72                NA         NA      NA       NA
v73                NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 59 and 0 DF,  p-value: NA

Python Code:

Y = pd.read_csv(pathname+"Y.csv")
X = pd.read_csv(pathname+"X.csv")

lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lr.fit(X, Y['d1'])

(list(zip(lr.coef_, X)))

lr.intercept_

Python Result:

intercept = 29.396033164254106

[(-2.4463986167806304, 'v1'),
 (-1.6293010275307021, 'v2'),
 (0.89089949009506508, 'v3'),
 (-3.1021251646895251, 'v4'),
 (-1.7707078771936109, 'v5'),
 (-2.0474705122225636, 'v6'),
 (-1.5537181337496202, 'v7'),
 (-1.6391241229716156, 'v8'),
 (-1.2981646048517046, 'v9'),
 (0.89221826294889328, 'v10'),
 (-0.56694104645951571, 'v11'),
 (2.042810365310288e-14, 'v12'),
 (-2.0312478672439052, 'v13'),
 (-1.5617121392788413, 'v14'),
 (0.4583365939498274, 'v15'),
 (0.8840538748922967, 'v16'),
 (-5.5952681002058871, 'v17'),
 (2.4937042448512892, 'v18'),
 (0.45806845189176543, 'v19'),
 (-1.1648810657830406, 'v20'),
 (-1.7800004329275585, 'v21'),
 (-5.0132817522704816, 'v22'),
 (3.6862778096189266, 'v23'),
 (2.7533531010703882e-14, 'v24'),
 (1.2150003225741557e-14, 'v25'),
 (0.94669823515018103, 'v26'),
 (-0.3082823207975679, 'v27'),
 (0.53619247380957358, 'v28'),
 (-1.1339902793546781, 'v29'),
 (1.9657159583080186, 'v30'),
 (-0.63200501460653324, 'v31'),
 (1.4741013580918978, 'v32'),
 (-2.4448418291953313, 'v33'),
 (-2.0787115960875036, 'v34'),
 (0.22492914212063603, 'v35'),
 (-0.75136276693004922, 'v36'),
 (1.2838658951186761, 'v37'),
 (0.5816277993227944, 'v38'),
 (-0.11270569554555088, 'v39'),
 (-0.13430982360936233, 'v40'),
 (-3.3189296496897662, 'v41'),
 (-0.452575588270415, 'v42'),
 (6.1329755709937519, 'v43'),
 (0.61559185634634817, 'v44'),
 (-1.206315459828555, 'v45'),
 (-3.7452010299772009, 'v46'),
 (-1.1472174665136678, 'v47'),
 (2.8960489381172652, 'v48'),
 (0.0090220136972478659, 'v49'),
 (-5.264918363314754, 'v50'),
 (1.2194758337662015, 'v51'),
 (2.78655271320092, 'v52'),
 (3.106513852668896, 'v53'),
 (3.5181252502607929, 'v54'),
 (-0.34426523770507278, 'v55'),
 (-0.48792823932479878, 'v56'),
 (0.12284460490031779, 'v57'),
 (1.6860388628044991, 'v58'),
 (1.2823067194737174, 'v59'),
 (2.8352263554153665, 'v60'),
 (-1.304336378501032, 'v61'),
 (0.55226132316435139, 'v62'),
 (1.5416988124754771, 'v63'),
 (-0.2605804175310813, 'v64'),
 (1.2489066081702334, 'v65'),
 (-0.44469553013696161, 'v66'),
 (-1.4102990055550157, 'v67'),
 (3.8150423259022639, 'v68'),
 (0.12039684410168072, 'v69'),
 (-1.340699466779357, 'v70'),
 (1.7066389124439212, 'v71'),
 (0.50470752944860442, 'v72'),
 (1.0024872633969766, 'v73')]

But it is not matching.Please help.

Note:It matched for below example

http://davidcoallier.com/blog/linear-regression-from-r-to-python/

Theola answered 2/12, 2016 at 15:25 Comment(15)

Start with the basics. Read the data, compare summary statistics, make you are consistently regressing y on X etc pp. There is not much that can go wrong in the mechanics of how to run a linear regression... – Butterball 2/12, 2016 at 15:33

Dirk,I am aware of the proces.This is mainly for conversion.I am trying to convert an application in R to python which got this code.I need to translate as this and match the coefficients.That is it. – Theola 2/12, 2016 at 15:36

Are default contrasts in R and Python the same? – Burundi 2/12, 2016 at 15:42

Not sure of it Roman.Didnt change any for given example (revenue) in link atleast – Theola 2/12, 2016 at 15:45

Please provide a self contained minimal example including all input and output and specifically point out what is different. Include the data right in the post and if it is large cut it down to the minimal size needed to still illustrate the problem. See minimal reproducible example. – Suppurate 2/12, 2016 at 15:55

Actually I was trying match coefficients.Here number of independent variables are large and this is reason I posted links to files.I have already worked on minimal example and coefficients were matching for that.Was wondering whats wrong in this case particularly when variables are many – Theola 2/12, 2016 at 16:9

As you say (it would have been good to point this out), you have an n>p situation; 73 predictor variables and only 61 responses. Things get tricky here, both statistically and computationally. R is pretty clever about rank-deficient fits, I don't have any idea what Python does. In any case you will probably have to work harder to get a sensible answer in this kind of situation; naive linear regression will probably give you garbage. – Skirret 2/12, 2016 at 16:11

See statsmodels.sourceforge.net/devel/pitfalls.html: " * Rank deficient matrices will not raise an error. * Cases of almost perfect multicollinearity or ill-conditioned design matrices might produce numerically unstable results. Users need to manually check the rank or condition number of the matrix if this is not the desired behavior" – Skirret 2/12, 2016 at 16:13

Nice work, @BenBolker. So my initial hunch to check the data was not all that far off... – Butterball 2/12, 2016 at 16:18

If you edit your question to include these specific issues (i.e. don't make us guess!), we could probably answer - if not solve the problem. You could set up a reproducible example with multicollinearity, e.g. dd <- data.frame(y=rnorm(4),a=1:4,b=2:5,c=rnorm(4)); coef(lm(y~a+b+c,dd)) – Skirret 2/12, 2016 at 16:19

Thanks Ben.As I said ,this is an existing application in R and as per data they are dropping variables with NA or 0 coefficients.We are trying to convert into python and need to bring out exact results meaning same variables to be dropped as in R.But that is not happening. – Theola 2/12, 2016 at 16:22

I updated the results too.As you see NAs in R results but Python is different – Theola 2/12, 2016 at 16:23

@maddykemen You fundamentally misunderstand the problem, and will very likely be unsuccessful in "matching" the Python answer to what R gives you for what is fundamentally an ill-posed problem. Use a normal data set with N > k, and get that right. What you have here is special case dealt with deep in the bowels of R, and Python does not have those. – Butterball 2/12, 2016 at 16:26

Dirk,is there no way to handle that situation in Python.Any references will help t – Theola 2/12, 2016 at 16:30

You will not be able to consistently generate 73 coefficients with only 60 data observations. I would question your model and not whether Python or R is providing the correct result. – Mika 2/12, 2016 at 16:40

tl;dr if you want to replicate R's strategy in Python you're probably going to have to implement it yourself, as R does some clever stuff that's not widely available elsewhere.

For reference (since mentioned so far only in comments), this is an ill-posed/rank-deficient fit, which will always happen when there are more predictor variables than responses (p>n: in this case p=73, n=61), and often when there are many categorical responses and/or the experimental design is limited in some way. Dealing with these situations so as to get an answer that means anything at all typically requires careful thought and/or advanced techniques (e.g. penalized regression: see refs to lasso and ridge regression in the Wikipedia article on linear regression).

The most naive way to handle this situation is to throw it all in to the standard linear algebra and hope that nothing breaks too badly, which is apparently what python's statsmodels package does: from the pitfalls document:

Rank deficient matrices will not raise an error.

Cases of almost perfect multicollinearity or ill-conditioned design matrices might produce numerically unstable results. Users need to manually check the rank or condition number of the matrix if this is not the desired behavior.

The next best thing (reasonable when there is a small degree of collinearity) is to pivot sensibly when doing the linear algebra, that is, rearrange the computational problem so that the collinear parts can be left out. That's what R does; in order to do so, the authors of the R code had to modify the standard LINPACK routines.

The underlying code has this comment/explanation:

c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.

c i am very nervous about modifying linpack code in this way.
c if you are a computational linear algebra guru and you really
c understand how to solve this problem please feel free to
c suggest improvements to this code.

This code (and comment) have been in R's code base since 1998; I'd love to know who originally wrote it (based on comments further down in the code it seems to have been Ross Ihaka?), but am having trouble following the code's history back beyond a code reorganization in 1998. (A little more digging suggests that this code has been in R's code base essentially since the beginning of its recorded history, i.e. the file was added in SVN revision 2 1997-09-18 and not modified until much later.)

Martin Mächler recently (2016 Oct 25, here) added more information about to ?qr, so that this information will actually be available in the documentation in the next release of R ...

If you know how to link compiled FORTRAN code with Python code (I don't), it would be pretty easy to compile src/appl/dqrdc2.f and translate the guts of lm.fit into Python: this is the core of lm.fit, minus error-checking and other processing ...

z <- .Call(C_Cdqrls, x, y, tol, FALSE)
coef <- z$coefficients
pivot <- z$pivot
r1 <- seq_len(z$rank)
dn <- colnames(x)
nmeffects <- c(dn[pivot[r1]], rep.int("", n - z$rank))
r2 <- if (z$rank < p) 
    (z$rank + 1L):p
else integer()
if (is.matrix(y)) { ## ... 
}  else {
    coef[r2] <- NA
    if (z$pivoted) 
        coef[pivot] <- coef
    names(coef) <- dn
    names(z$effects) <- nmeffects
}
z$coefficients <- coef

Skirret answered 2/12, 2016 at 16:51 Comment(0)

As a complement to Ben Bolker's answer.

The main problem is what a statistical package should do with ill-posed problems. As far as I have seen, there are large differences across packages in how singular and almost singular design matrices are handled. The only deterministic way is if a user chooses explicitly a variable selection or penalization algorithm.

If the rank can be clearly identified, then the outcome is still deterministic but varies by chosen singularity policy. Stata and R drop variables, Stata drops them in sequence as the variables are listed, i.e. drop the last collinear variable. I don't know which variables R drops. statsmodels uses a symmetric handling of variables and drops singular values by using a generalized inverse, pinv, based on singular value decomposition. This corresponds to a tiny penalization as in PCA reduced rank regression or Ridge regression.

The result is not deterministic, i.e. depends on the linear algebra package and might not even be the same on different computers, if numerical noise affects the identification of the rank or of collinear variables. Specifically, rank revealing QR and pinv/SVD need thresholds below which the rank is identified as reduced. In statsmodels, the default threshold from numpy for the relative condition number is around 1e-15. So we get a regularized solution if singular values are smaller than that. But the threshold might be too low in some cases and the "non-singular" solution is dominated by numerical noise which cannot be replicated. I guess the same will be true for any rank revealing QR or other purely numerical solution to the collinearity problem.
(Robustness issue of statsmodel Linear regression (ols) - Python
https://github.com/statsmodels/statsmodels/issues/2628
and related
statmodels in python package, How exactly duplicated features are handled?)

About rank revealing, pivoting QR:

It was not available in scipy when most of statsmodels.OLS was written, but it has been available in scipy, as pivoting keyword, for some time now, but has not yet been added as an option to statsmodels.OLS https://github.com/scipy/scipy/pull/44

I'm skeptical about it as a default solution, because I guess, without ever having verified it, that numerical noise will affect which variables are pivoted. Then, it will not be deterministic anymore which variables will be dropped. In my opinion, variable selection should be a conscious choice by the user and not left to purely numerical choices.

(disclaimer: I'm a statsmodels maintainer)

edit:

The question uses scikit-learn in the example.

As far as I can see, scikit-learn uses a different LAPACK function to solve the linear regression problem, but it is also based on a singular value decomposition like statsmodels. However, scikit-learn uses currently the default threshold of the corresponding scipy function which is smaller than the numpy function that statsmodels uses.
e.g. how does sklearn do Linear regression when p >n?

So, I expect that scikit-learn and statsmodels have the same results in the singular case, but the results will differ in some near singular cases.

Oilcan answered 3/12, 2016 at 18:39 Comment(5)

wouldn't a warning on rank-deficiency or near-rank-deficiency (perhaps with an adjustable tolerance) be a good idea though ...? – Skirret 3/12, 2016 at 18:59

@Ben Bolker The OLS results summary has an additional comment to warn about high condition number or singular values near zero. The condition number is also in the extra results in OLS summary. But overall, statsmodels is not warning enough about cases like this and still leaves it up to the user to check suspicious results in too many places. – Oilcan 3/12, 2016 at 19:44

(to continue comment) One problem is that too many warnings can be annoying in library usage, when it is easier to check an attribute. And for developers it is often difficult to figure out where, when and at which threshold to warn especially for cases like "maybe you don't want to do this!". issue for adding "sanity checks": github.com/statsmodels/statsmodels/issues/1908 – Oilcan 3/12, 2016 at 19:50

I agree it's a very tough call, and also splits on expected use case: is the library being used for relatively shallow data analysis scripts (in which case you want the warnings), or as the foundation for more complex analytical tools (in which case it's the responsibility of downstream developers to check attributes and handle problem cases appropriately) ? – Skirret 3/12, 2016 at 19:54

Statsmodels has the library usage problem internally. OLS and WLS are reused in many cases (in IRLS for example or for diagnostic tests). In those cases, we would have to catch all warnings if we don't want them to bubble up to users for intermediate calculations. – Oilcan 3/12, 2016 at 19:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags