Pandas dataframe divide features to group of high correlation

Asked 19/10, 2020 at 9:34 Answered 1/5, 2022 at 14:32

Solved pandas dataframe data-science feature-selection yellowbrick

I have a dataframe with over 280 features. I ran correlation map to detect groups of features that are highly correlated: Now, I want to divide the features to groups, such that each group will be a "red zone", meaning each group will have features that are all have correlation >0.5 with each other.

How can it be done?

Thanks

Remiss answered 19/10, 2020 at 9:34 Comment(2)

I think it may be possible for the groups to intersect. Say, feature [A,B,C] form a red group, [C,D,E] form a red group but [A,B,C,D,E] do not. How do you deal with such situation? – Plataea 19/10, 2020 at 11:5

@BillHuang I want that all the features will be "red" with other, within each group. so in the case you suggested - one group is [A,B,C] and one group is [C,D,E] – Remiss 19/10, 2020 at 11:44

Disclaimer:

Visualization is not addressed in this solution. Only groups were found.
The solution is known to be NP-hard, so mind efficiency problems.

Theory

The problem is essentially a clique problem in graph theory, which means finding all the complete subgraphs in a given graph (with nodes > 2).

Imagine a graph that all the features are nodes and pairs of features satisfying corr > 0.5 are edges. Then the task of finding all "groups" requested can simply translates into "finding all complete subgraphs in the graph".

Code

The code uses networkx.algorithms.find_cliques for the search task, which implements Bron–Kerbosch algorithm according to the docs.

The code conprises of two parts. The first part extract the edges using np.triu (modified from this post) and the second part feeds the edge list into networkx.

The Coorelation Matrix

Feature [A,B,C] and [C,D,E] are closely correlated respectively, but not between [A,B] and [D,E].

np.random.seed(111)  # reproducibility
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
a = x
b = x + np.random.normal(0, .5, 100)
c = x + y
d = y + np.random.normal(0, .5, 100)
e = y + np.random.normal(0, .5, 100)

df = pd.DataFrame({"A":a, "B":b, "C":c, "D":d, "E":e})
corr = df.corr()

corr
Out[24]: 
          A         B         C         D         E
A  1.000000  0.893366  0.677333 -0.078369 -0.090510
B  0.893366  1.000000  0.577459 -0.072025 -0.079855
C  0.677333  0.577459  1.000000  0.587695  0.579891
D -0.078369 -0.072025  0.587695  1.000000  0.777803
E -0.090510 -0.079855  0.579891  0.777803  1.000000

Part 1

# keep only upper triangle elements (excluding diagonal elements)
mask_keep = np.triu(np.ones(corr.shape), k=1).astype('bool').reshape(corr.size)
# melt (unpivot) the dataframe and apply mask
sr = corr.stack()[mask_keep]
# filter and get names
edges = sr[sr > 0.5].reset_index().values[:, :2]

edges
Out[25]: 
array([['A', 'B'],
       ['A', 'C'],
       ['B', 'C'],
       ['C', 'D'],
       ['C', 'E'],
       ['D', 'E']], dtype=object)

Part 2

import networkx as nx
g = nx.from_edgelist(edges)
ls_cliques = []
for clique in nx.algorithms.find_cliques(g):
    ls_cliques.append(clique)

# result
ls_cliques
Out[26]: [['C', 'A', 'B'], ['C', 'D', 'E']]

Plataea answered 19/10, 2020 at 13:1 Comment(3)

Thanks, but I get an error: In the line sr = corr.stack()[mask_keep] , I get: IndexError: Item wrong length 97344 instead of 78400. The shape of mask_keep is (97344,), and corr is (312, 312). However, the length of corr_mat.stack() is 78400 which causes the error – Remiss 19/10, 2020 at 13:46

Please clear your entire work space and check your variable names carefully. As indicated by the fact that 78400=280^2, you are likely feeding the new mask onto old data. – Plataea 19/10, 2020 at 13:55

It is also possible that some of the 312 features were not present in the corr matrix (e.g. categorical features). Please remove these features and retry. – Plataea 19/10, 2020 at 13:57

I had the same issue here: the length of the stacked correlation matrix differed from that of the mask. What worked for me was to keep the NaNs while stacking as follows:

sr = corr.stack([dropna=False][1])[mask_keep]

@billhuang correctly states reasons why this could happen.

Sillabub answered 1/5, 2022 at 14:32 Comment(0)

Theory

Code

Recommended topics

Hot tags