Remove perfectly multicollinear variables from data frame

Asked 15/2, 2016 at 13:35 Answered 16/2, 2016 at 0:41

I have a data frame with variables, of which some contain the same information

x1 = runif(1000)
x2 = runif(1000)
x3 = x1 + x2
x4 = runif(1000)
x5 = runif(1000)*0.00000001 +x4
x6 = x5 + x3
x = data.frame(x1, x2, x3, x4, x5, x6)

In a next step I want to rid myself of all variables which are perfectly multicollinear, e.g. column x3 and x6 (there might be also other combinations).

In Stata this is fairly easy: _rmcoll varlist

How is this efficiently done in R?

EDIT: Note that the ultimate goal is to compute the Mahalanobis distance between observations. For this I need to drop redunant variables. And as far as I can foresee, for this application it would not matter whether I drop x1, x2 or x3

Pluckless answered 15/2, 2016 at 13:35 Comment(5)

Note that if variables (columns) are perfectly collinear, then there's arbitrariness about which is dropped. – Caracara 15/2, 2016 at 13:41

This is what i meant with "there might be also other combinations". In my context it does however not matter, which ones are dropped – Atrioventricular 15/2, 2016 at 13:42

For multicolinear data, I would either use principal component regression (see package pls), or some kind of regularized method such as lasso (see package glmnet). – Rebutter 15/2, 2016 at 13:52

Thank you. I am not 100% sure what you are suggesting. Not that I am not targeting at running regressions or anything. I pasted a clarifying comment into my question – Atrioventricular 15/2, 2016 at 13:55

Note: anyone wanting this thread to move to Cross Validated should note that it was previously posted there and put on hold. (In principle it could be off-topic in both places, but my own view is that it belongs here.) – Caracara 15/2, 2016 at 14:10

I don't know of a built-in convenience function, but QR decomposition will do it.

We need the data frame to be a matrix:

X <- as.matrix(x)

Use a slightly lower than default tolerance to keep the slightly-non-multicollinear column:

qr.X <- qr(X, tol=1e-9, LAPACK = FALSE)
(rnkX <- qr.X$rank)  ## 4 (number of non-collinear columns)
(keep <- qr.X$pivot[seq_len(rnkX)])
## 1 2 4 5 
X2 <- X[,keep]

This strictly answers your question; you might also be able to use singular value decomposition (svd()) to implement Mahalanobis distances directly on this type of data ...

Rosenfeld answered 15/2, 2016 at 14:9 Comment(0)

For completeness I post the quick-and-dirty solution I was using until now. I actually think it does not perform that badly compared to other methods.

x1 = runif(1000)
x2 = runif(1000)
x3 = x1 + x2
x4 = runif(1000)
x5 = runif(1000)*0.00000001 +x4
x6 = x5 + x3
x = data.frame(x1, x2, x3, x4, x5, x6)

const = rep(1,1000)
a<-lm(const ~ ., data=x)
names(a$coefficients[!is.na(a$coefficients)])[c(-1)]

Pluckless answered 16/2, 2016 at 0:41 Comment(0)

Recommended topics

Hot tags