List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

Asked 22/7, 2013 at 0:31 Answered 23/12, 2023 at 19:18

151

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a large matrix or Efficient way to get highly correlated pairs from large data set in Python or R), but I am wondering how to do it with pandas? In my case the matrix is 4460x4460, so can't do it visually.

Topflight answered 22/7, 2013 at 0:31 Comment(0)

135

You can use DataFrame.values to get an numpy array of the data and then use NumPy functions such as argsort() to get the most correlated pairs.

But if you want to do this in pandas, you can unstack and sort the DataFrame:

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

Here is the output:

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

Caboodle answered 22/7, 2013 at 1:43 Comment(4)

With Pandas v 0.17.0 and higher you should use sort_values instead of order. You will get an error if you try using the order method. – Garish 5/9, 2017 at 16:6

Also, in order to get the highly correlated pairs, you need to use sort_values(ascending=False). – Graehl 30/12, 2020 at 11:43

"numpy array of the data and then use NumPy functions such as argsort() to get the most correlated pairs." - could you show an example of this too? – Bloodshot 28/2, 2021 at 20:50

All these solutions using sort_values need a by argument now. But what to use if I want to sort all dataframe values, not just one column? – Firecure 28/4, 2022 at 18:6

@HYRY's answer is perfect. Just building on that answer by adding a bit more logic to avoid duplicate and self correlations and proper sorting:

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6], 
     'x2': [0, 0, 8, 2, 4], 
     'x3': [2, 8, 8, 10, 12], 
     'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()

print("Correlation Matrix")
print(df.corr())
print()

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

That gives the following output:

Data Frame
   x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5

Correlation Matrix
          x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000

Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

Beersheba answered 3/1, 2017 at 23:15 Comment(2)

instead of get_redundant_pairs(df), you can use "cor.loc[:,:] = np.tril(cor.values, k=-1)" and then "cor = cor[cor>0]" – Harriot 21/3, 2017 at 5:59

I'm getting erro for line au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False) : # -- partial selection or non-unique index – Cathicathie 6/7, 2018 at 12:46

Few lines solution without redundant pairs of variables:

corr_matrix = df.corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)

sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
                  .stack()
                  .sort_values(ascending=False))

#first element of sol series is the pair with the biggest correlation

Then you can iterate through names of variables pairs (which are pandas.Series multi-indexes) and theirs values like this:

for index, value in sol.items():
  # do some staff

Maidinwaiting answered 28/3, 2017 at 15:30 Comment(7)

probably a bad idea to use os as a variable name because it masks the os from import os if available in the code – Dis 29/8, 2018 at 7:29

Thanks for your suggestion, i changed this unproper var name. – Maidinwaiting 23/10, 2018 at 9:33

as of 2018 use sort_values(ascending=False) instead of order – Feer 26/11, 2018 at 21:10

how to loop 'sol'?? – Gull 1/7, 2020 at 19:29

@Gull I placed an answer to your question above – Maidinwaiting 2/7, 2020 at 18:39

In 2022. Getting warning using the above code . DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. Deprecated in NumPy 1.20; for more details and guidance: numpy.org/devdocs/release/1.20.0-notes.html#deprecations """ – Fango 18/4, 2022 at 2:54

@Fango thanks for letting know, changed it – Maidinwaiting 26/1, 2023 at 14:37

Combining some features of @HYRY and @arun's answers, you can print the top correlations for dataframe df in a single line using:

df.corr().unstack().sort_values().drop_duplicates()

Note: the one downside is if you have 1.0 correlations that are not one variable to itself, the drop_duplicates() addition would remove them

Misesteem answered 27/6, 2018 at 21:24 Comment(4)

Wouldn't drop_duplicates drop all correlations that are equal? – Dis 29/8, 2018 at 7:30

@shadi yes, you are correct. However, we assume the only correlations which will be identically equal are correlations of 1.0 (i.e. a variable with itself). Chances are that the correlation for two unique pairs of variables (i.e. v1 to v2 and v3 to v4) would not be exactly the same – Misesteem 29/8, 2018 at 21:42

Definitely my favoirite, simplicity itself. in my usage, I filtered first for high corrleations – Multiform 22/8, 2020 at 12:18

This is good. I would add .sort_values(ascending = False) to improve visibility – Trieste 29/10, 2022 at 14:22

I liked Addison Klinke's post the most, as being the simplest, but used Wojciech Moszczyńsk’s suggestion for filtering and charting, but extended the filter to avoid absolute values, so given a large correlation matrix, filter it, chart it, and then flatten it:

Created, Filtered and Charted

dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

Function

In the end, I created a small function to create the correlation matrix, filter it, and then flatten it. As an idea, it could easily be extended, e.g., asymmetric upper and lower bounds, etc.

def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

corrFilter(df, .7)

Follow-Up

Eventually, I refined the functions

# Returns correlation matrix
def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    return xFiltered

# flattens correlation matrix with bounds
def corrFilterFlattened(x: pd.DataFrame, bound: float):
    xFiltered = corrFilter(x, bound)
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

# Returns correlation for a variable from flattened correlation matrix
def filterForLabels(df: pd.DataFrame, label):  
    try:
        sideLeft = df[label,]
    except:
        sideLeft = pd.DataFrame()

    try:
        sideRight = df[:,label]
    except:
        sideRight = pd.DataFrame()

    if sideLeft.empty and sideRight.empty:
        return pd.DataFrame()
    elif sideLeft.empty:        
        concat = sideRight.to_frame()
        concat.rename(columns={0:'Corr'},inplace=True)
        return concat
    elif sideRight.empty:
        concat = sideLeft.to_frame()
        concat.rename(columns={0:'Corr'},inplace=True)
        return concat
    else:
        concat = pd.concat([sideLeft,sideRight], axis=1)
        concat["Corr"] = concat[0].fillna(0) + concat[1].fillna(0)
        concat.drop(columns=[0,1], inplace=True)
        return concat

Multiform answered 22/8, 2020 at 12:51 Comment(5)

how to remove the very last one? HofstederPowerDx and Hofsteder PowerDx are the same variables, right? – Brittne 20/10, 2020 at 9:29

one can use .dropna() in the functions. I just tried it in VS Code and it works, where I use the first equation to create and filter the correlation matrix, and another to flatten it. If you use that, you might want to experiment with removing .dropduplicates() to see whether you need both .dropna() and dropduplicates(). – Multiform 21/10, 2020 at 13:16

A notebook that includes this code and some other improvements is here: github.com/JamesIgoe/GoogleFitAnalysis – Multiform 21/10, 2020 at 13:39

I believe the code is summing up the r value twice here, please correct if I am wrong, – Congdon 17/4, 2021 at 5:52

@Sidrah - I did some basic spot checking and it seems to be accurate, but if you've tried to use it and it is doubling fro you, let me know. – Multiform 18/4, 2021 at 18:4

You can do graphically according to this simple code by substituting your data.

corr = df.corr()

kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")

Fatal answered 27/3, 2020 at 8:16 Comment(1)

Would I want something like kot = corr[abs(corr) >= 0.9] in case of strong negative correlations too? – Bloodshot 28/2, 2021 at 20:42

Use the code below to view the correlations in the descending order.

# See the correlations in descending order

corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

Halftruth answered 7/4, 2018 at 18:18 Comment(2)

Your 2nd line should be: c1 = core.abs().unstack() – Staley 20/12, 2018 at 21:23

or first line corr = df.corr() – Push 13/2, 2019 at 16:22

Combining most the answers above into a short snippet:

def top_entries(df):
    mat = df.corr().abs()
    
    # Remove duplicate and identity entries
    mat.loc[:,:] = np.tril(mat.values, k=-1)
    mat = mat[mat>0]

    # Unstack, sort ascending, and reset the index, so features are in columns
    # instead of indexes (allowing e.g. a pretty print in Jupyter).
    # Also rename these it for good measure.
    return (mat.unstack()
             .sort_values(ascending=False)
             .reset_index()
             .rename(columns={
                 "level_0": "feature_a",
                 "level_1": "feature_b",
                 0: "correlation"
             }))

Mauromaurois answered 8/2, 2021 at 20:56 Comment(0)

Lot's of good answers here. The easiest way I found was a combination of some of the answers above.

corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
    .sort_values(by='column', ascending=False)\
    .dropna()

Littrell answered 10/3, 2019 at 2:16 Comment(0)

The following function should do the trick. This implementation

Removes self correlations
Removes duplicates
Enables the selection of top N highest correlated features

and it is also configurable so that you can keep both the self correlations as well as the duplicates. You can also to report as many feature pairs as you wish.

def get_feature_correlation(df, top_n=None, corr_method='spearman',
                            remove_duplicates=True, remove_self_correlations=True):
    """
    Compute the feature correlation and sort feature pairs based on their correlation

    :param df: The dataframe with the predictor variables
    :type df: pandas.core.frame.DataFrame
    :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
    :param corr_method: Correlation compuation method
    :type corr_method: str
    :param remove_duplicates: Indicates whether duplicate features must be removed
    :type remove_duplicates: bool
    :param remove_self_correlations: Indicates whether self correlations will be removed
    :type remove_self_correlations: bool

    :return: pandas.core.frame.DataFrame
    """
    corr_matrix_abs = df.corr(method=corr_method).abs()
    corr_matrix_abs_us = corr_matrix_abs.unstack()
    sorted_correlated_features = corr_matrix_abs_us \
        .sort_values(kind="quicksort", ascending=False) \
        .reset_index()

    # Remove comparisons of the same feature
    if remove_self_correlations:
        sorted_correlated_features = sorted_correlated_features[
            (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
        ]

    # Remove duplicates
    if remove_duplicates:
        sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]

    # Create meaningful names for the columns
    sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']

    if top_n:
        return sorted_correlated_features[:top_n]

    return sorted_correlated_features

Lands answered 1/4, 2020 at 17:12 Comment(1)

Maybe make abs() a parameter. Other than that, excellent function. Very reusable and well documented. – Burbot 27/2, 2022 at 11:12

Use itertools.combinations to get all unique correlations from pandas own correlation matrix .corr(), generate list of lists and feed it back into a DataFrame in order to use '.sort_values'. Set ascending = True to display lowest correlations on top

corrank takes a DataFrame as argument because it requires .corr().

  def corrank(X: pandas.DataFrame):
        import itertools
        df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])    
        print(df.sort_values(by='corr',ascending=False))

  corrank(X) # prints a descending list of correlation pair (Max on top)

Thrombocyte answered 22/9, 2017 at 10:46 Comment(1)

While this code snippet may be the solution, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Labyrinth 22/9, 2017 at 11:6

I didn't want to unstack or over-complicate this issue, since I just wanted to drop some highly correlated features as part of a feature selection phase.

So I ended up with the following simplified solution:

# map features to their absolute correlation values
corr = features.corr().abs()

# set equality (self correlation) as zero
corr[corr == 1] = 0

# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)

# display the highly correlated features
display(corr_cols[corr_cols > 0.8])

In this case, if you want to drop correlated features, you may map through the filtered corr_cols array and remove the odd-indexed (or even-indexed) ones.

Adventurism answered 7/10, 2019 at 16:3 Comment(2)

This just gives one index (feature) and not something like feature1 feature2 0.98. Change linecorr_cols = corr.max().sort_values(ascending=False) to corr_cols = corr.unstack() – Asha 8/10, 2019 at 20:17

Well the OP did not specify a correlation shape. As I mentioned, I didn't want to unstack, so I just brought a different approach. Each correlation pair is represented by 2 rows, in my suggested code. But thanks for the helpful comment! – Adventurism 9/10, 2019 at 23:10

This is a improve code from @MiFi. This one order in abs but not excluding the negative values.

   def top_correlation (df,n):
    corr_matrix = df.corr()
    correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
    correlation = pd.DataFrame(correlation).reset_index()
    correlation.columns=["Variable_1","Variable_2","Correlacion"]
    correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
    return correlation.head(n)

top_correlation(ANYDATA,10)

Neurophysiology answered 23/1, 2020 at 12:8 Comment(0)

I was trying some of the solutions here but then I actually came up with my own one. I hope this might be useful for the next one so I share it here:

def sort_correlation_matrix(correlation_matrix):
    cor = correlation_matrix.abs()
    top_col = cor[cor.columns[0]][1:]
    top_col = top_col.sort_values(ascending=False)
    ordered_columns = [cor.columns[0]] + top_col.index.tolist()
    return correlation_matrix[ordered_columns].reindex(ordered_columns)

Berwickupontweed answered 16/10, 2019 at 13:25 Comment(0)

simple is better

from collections import defaultdict
res = defaultdict(dict)
corr = returns.corr().replace(1, -1)
names = list(corr)

for name in names:
    idx = corr[name].argmax()
    max_pairwise_name = names[idx]
    res[name][max_pairwise_name] = corr.loc[max_pairwisename, name]

Now res contains the maximum pairwise correlation for each pair

Buddhology answered 1/12, 2022 at 3:28 Comment(0)

I was doing something similar to this but I wanted to change the threshold of the Top Correlated Values for different dataframes. So I built this function combining some of the solutions above and on a few other questions.

def get_top_correlations(df, abs_bound=0.8, n=None):
        
    '''    
    Function to remove self-correlations and get the upper correlated pairs based on threshold entered.

    Parameters
    ----------
    Enter Dataframe for correlation. 
    Enter Threshold between 0 and 1 - the function will look at both positive and negative of the value entered
    Enter the number of rows to be returned by df.head - optional

    Returns
    -------
    Sorted Pairs of +/- Correlations in a dataframe

    '''
    

    corr = df.corr()    # Run the Correlation Matrix on the Dataframe
    np.fill_diagonal(corr.values, np.nan)   # Replace the self-correlated values with nan
    
    # Create a mask to hide the duplicates on the upper triangle of the matrix
    mask = np.triu(np.ones_like(corr, dtype=bool))  # StackOverflow 69463800
    
    # Apply the mask, filter the matrix based on the positive and negative input bound
    mask_corr = corr.where(mask).stack().reset_index()  # StackOverflow  58359907 & index
    filtered_corr = mask_corr[(mask_corr[0] >= abs_bound) | (mask_corr[0] <= -abs_bound)] # Maintaining the + / - values

    # Label & Sort the Columns
    filtered_corr.columns = ['Row Stat', 'Column Stat', 'Correlation']
    sort_corr = filtered_corr.sort_values(by='Correlation', ascending=False)

    print(sort_corr.head(n))
    print(sort_corr.tail(n))

    return sort_corr

Gaspar answered 23/12, 2023 at 19:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags