Find the pair of most correlated variables
Asked Answered
T

2

6

Suppose I have a data frame consisting of 20 columns (variables) and all of them are numeric. I can always use the cor function in R to get the correlation coefficients in matrix form or actually visualize the correlation matrix (with correlation coefficients labeled as well). Suppose I just want to sort the pairs according to the correlation coefficients value, how to do this in R ?

Titrate answered 19/9, 2017 at 19:23 Comment(0)
L
9

Solution using corrr:

corrr is a package for exploring correlations in R. It focuses on creating and working with data frames of correlations

library(corrr)
matrix(rnorm(100), 5) %>%
    correlate() %>% 
    stretch() %>% 
    arrange(r)

Solution using reshape2 & data.table:

You can reshape2::melt (imported with data.table) cor result and order (sort) according correlation values.

library(data.table)
corMatrix <- cor(matrix(rnorm(100), 5))
setDT(melt(corMatrix))[order(value)]
Limacine answered 19/9, 2017 at 19:27 Comment(2)
@Frank Thank you! Didn't know that: if (is.data.table(data)) {UseMethod("melt", data)} else {reshape2::melt}Limacine
I don't think OP is interested in correlation of V1 and V1 which obviously is 1. So I suggest changing last line to setDT(melt(corMatrix))[Var1 != Var2][order(value)]Ebonee
P
7

dplyr + tidyr solution:

set.seed(123)
mat = matrix(rnorm(50), nrow = 10, ncol = 5)
colnames(mat) = paste0("X", 1:5)

library(dplyr)
library(tidyr)

cor(mat) %>%
  as.data.frame() %>%
  mutate(var1 = rownames(.)) %>%
  gather(var2, value, -var1) %>%
  arrange(desc(value))

Since we know that correlation matrices are symmetric (cor(X1, X2)==cor(X2, X1)), we can group_by values column and remove duplicates:

cor(mat) %>%
  as.data.frame() %>%
  mutate(var1 = rownames(.)) %>%
  gather(var2, value, -var1) %>%
  arrange(desc(value)) %>%
  group_by(value) %>%
  filter(row_number()==1)

Result:

# A tibble: 11 x 3
# Groups:   value [11]
    var1  var2       value
   <chr> <chr>       <dbl>
 1    X1    X1  1.00000000
 2    X4    X1  0.67301956
 3    X2    X1  0.57761512
 4    X4    X2  0.27131880
 5    X5    X4  0.07453706
 6    X5    X3  0.02265933
 7    X5    X2 -0.25201740
 8    X5    X1 -0.34863673
 9    X3    X1 -0.40595930
10    X4    X3 -0.43726491
11    X3    X2 -0.56734869
Polymer answered 19/9, 2017 at 19:34 Comment(2)
Don't we want to exclude the correlation of a variable to itself, i.e. Row 1 in the output var1=X1 var2=X1?Bobette
@Bobette It's up to you to include it or not as long as you know what it means. OP hasn't provided expected output, so I'm just leaving it here.Polymer

© 2022 - 2024 — McMap. All rights reserved.