Screening (multi)collinearity in a regression model

Asked 15/6, 2010 at 2:10 Answered 25/7, 2014 at 20:50

I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors.

One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R², and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different.

Of course, we must always bear in mind the specific context/goal of the analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

Sink answered 15/6, 2010 at 2:10 Comment(13)

I'm very pleased that no one marked this as not "programmy" enough and many people up voted this. This is a very good question that many of us who "program with data" struggle with. – Mottle 15/6, 2010 at 17:4

Great question, and wonderful answers. A very rewarding read - thank you. – Seifert 15/6, 2010 at 17:12

Credits should go to friend of mine... she asked me about collinearity, and after searching topics on SO, I figured out that there are no questions about it... which was quite odd, since the collinearity problem is mundane in statistical analysis. Thank you lads for these great answers! – Sink 15/6, 2010 at 21:7

Great stuff guys, I really appreciate seeing an R community spring up here at SO. – Watchband 16/6, 2010 at 17:7

@dmckee: Then I think it's R that's out of topic at SO. R is an environment for statistical computing, data analysis and graphics. Thankfully it's not for boring tasks like application development. Calling functions typing their names rather than using a GUI, doesn't make me feel like a programmer in any way. – Dragline 16/6, 2010 at 20:37

@dmckee: the same argument could be made for almost all algorithm questions as they are simply NOT programing, as they do not deal with a specific language implementation of an algorithm. They are logic and therefor should not be included in Stack Overflow. That kinda sounds silly doesn't it? – Mottle 16/6, 2010 at 21:5

@Brani, @JD Long: I do nuclear physics analysis in ROOT, so I appreciate the value of the good tools, and don't object to R questions on SO; but the question is still "What are some methods in *statistics for identifying colinearity and multi-collinearity?"* – Rutilant 16/6, 2010 at 21:46

...and it happens that we do statistics in R. =) It's not explicitly stated in question title, though, but question it tagged as r specific... And, in the end, an answers matter the most. Thank you once again for these great answers! – Sink 16/6, 2010 at 22:59

what really matters is what the [r] tag community in stackoverflow thinks is appropriate. And this is one of the highest rated R questions ever. QED – Mottle 17/6, 2010 at 17:57

JD, I'm glad 'bout the first one, and still cannot believe 'bout the second one... O_o – Sink 17/6, 2010 at 21:43

@al3xa as of this moment, this question is rated #16 out of 1306 questions tagged [r]. I believe this is evidence that there is a desire for applied analytical questions using [r]. – Mottle 18/6, 2010 at 15:5

@JD, I couldn't agree with you more! I hope that questions like this will appear more often... – Sink 18/6, 2010 at 19:50

This question appears to be off-topic because it is about statistical practice. It should be migrated to CrossValidated (which didn't exist when the question was originally asked ...) – Swatch 31/12, 2013 at 23:8

The kappa() function can help. Here is a simulated example:

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity
> kappa(mm12)                            # a 'low' kappa is good
[1] 1.166029
> kappa(mm123)                           # a 'high' kappa indicates trouble
[1] 121530.7

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>

This used approximations, see help(kappa) for details.

Tarpan answered 15/6, 2010 at 2:58 Comment(0)

Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:

1) the determinant of the covariance matrix which ranges from 0 (Perfect Collinearity) to 1 (No Collinearity)

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.

Symer answered 15/6, 2010 at 8:23 Comment(3)

Thanks Γιώργος, +2 for this one! Great answer! – Sink 15/6, 2010 at 16:50

why would the determinant of the covariance matrix be capped at 1?? – Impassion 24/2, 2014 at 20:24

blog.exploratory.io/… – Symer 24/11, 2019 at 6:7

You might like Vito Ricci's Reference Card "R Functions For Regression Analysis" http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

It succinctly lists many useful regression related functions in R including diagnostic functions. In particular, it lists the vif function from the car package which can assess multicollinearity. http://en.wikipedia.org/wiki/Variance_inflation_factor

Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/

Donatus answered 15/6, 2010 at 9:8 Comment(1)

Technically, and arithmetically, VIF = 1(1 - R^2), where R^2 refers to example I stated in my question. I forgot to mention VIF, so thanks for helping on this one! relaimpo is a great find! – Sink 15/6, 2010 at 11:41

See also Section 9.4 in this Book: Practical Regression and Anova using R [Faraway 2002].

Collinearity can be detected in several ways:

Examination of the correlation matrix of the predictors will reveal large pairwise collinearities.
A regression of x_i on all other predictors gives R^2_i. Repeat for all predictors. R^2_i close to one indicates a problem — the offending linear combination may be found.
Examine the eigenvalues of t(X) %*% X, where X denotes the model matrix; Small eigenvalues indicate a problem. The 2-norm condition number can be shown to be the ratio of the largest to the smallest non-zero singular value of the matrix ($\kappa = \sqrt{\lambda_1/\lambda_p}$; see ?kappa); \kappa >= 30 is considered large.

Inhuman answered 15/6, 2010 at 7:50 Comment(1)

link-only answers are deprecated on SO – Swatch 25/7, 2014 at 20:54

Since there is no mention of VIF so far, I will add my answer. Variance Inflation Factor>10 usually indicates serious redundancy between predictor variables. VIF indicates the factor by which variance of the co-efficient of a variable would increase if it was not highly correlated with other variables.

vif() is available in package cars and applied to an object of class(lm). It returns the vif of x1, x2 . . . xn in object lm(). It is a good idea to exclude variables with vif >10 or introduce transformations to the variables with vif>10.

Pitcher answered 25/7, 2014 at 20:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags