Calculate correlation between columns of strings

Asked 9/7, 2018 at 8:55 Answered 13/11, 2023 at 17:17

Solved python python-3.x pandas string correlation

I've got a df that contains the columns profession and media. I would like to calculate the correlation between those two columns.

Is there a short hack of calculating the correlation of columns of strings? Or do I have transform each profession and media to a number and then calculate the correlation with .corr()?

I found a similar question (Is there a way to get correlation with string data and a numerical value in pandas?) but I would like to check the string, not each word within the string.

df

  profession        media      

0 media lawyer      print
1 student           online
2 student           print
3 professor         online
4 media lawyer      online

Tremulant answered 9/7, 2018 at 8:55 Comment(0)

You can convert datatype to categorical and then do it

df['profession']=df['profession'].astype('category').cat.codes
df['media']=df['media'].astype('category').cat.codes
df.corr()

Sigmon answered 9/7, 2018 at 9:12 Comment(3)

Can you provide some explanation to this answer. It works fine. it's just that I wanna know why .cat.codes. And what does .codes do? – Edin 30/10, 2018 at 16:52

.cat.codes converts your category from a string representation into an integer representation. For example, media lawyer would be replaced with 0, student would be replaced with 1, professor would be replaced with 2. In the other column, print would be replaced with 0 and online would be replaced with 1 – Edge 15/4, 2020 at 21:46

Does this makes sense? Because if we have more than 2 values of strings, this can make 3 categories of 0, 1 and 2, and it might interpret that 2 is farther from 0 than 1... Not sure if I made myself clear. – Iodometry 10/12, 2022 at 19:9

As @JAgustinBarrachina pointed out, the accepted answer introduces a bias because it uses the Pearson correlation method under the hood. The categorization of each column may produce the following:

media lawyer --> 0
student --> 1
Professor --> 2

Because the Pearson method computes linear correlation, it will compute the distance between each category. From the algorithm point of view a media lawyer will be more different from a professor (their distance is 2 - 0 = 2) than it will be different from a student ( 1 - 0 = 1). That is not true in this case so the resulting correlation would be biased.

From the docs 2 other correlation methods are available : The Kendall and the Spearman methods. But they both suppose an the categories are ordered. For example a category such as revenue : ["low", "medium", "high"] could be considered as ordered.

If there is no order between the categories of a column, a method using Chi² and Cramér's V is more appropriate:

import scipy.stats as ss
import pandas as pd
from pandas import DataFrame, Series

profession_and_media = DataFrame(data = {
    # Decupling data to simulate significance
    "profession" : ["media lawyer" , "student" , "student" , "professor" , "media lawyer"] * 10,
    "media" : ["print" , "online" , "print" , "online" , "online"] * 10
    })

def cramers_corrected_stat(columnA: Series, columnB: Series):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    confusion_matrix = pd.crosstab(columnA, columnB)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.to_numpy().sum(axis=None)
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))


def compute_category_correlation(df: DataFrame):
    """ Compute the correlation between string columns of a DataFrame
    """
    for column in df.columns:
        df.loc[:, column] = df[column].astype('category').cat.codes
    result = df.corr(method=cramers_corrected_stat)
    return result.style.background_gradient(cmap='Reds')

compute_category_correlation(profession_and_media)

Which gives

Standstill answered 13/11, 2023 at 17:17 Comment(0)

Recommended topics

Hot tags