How to run a multicollinearity test on a pandas dataframe?

Asked 12/1, 2018 at 9:43 Answered 10/4, 2021 at 11:58

I am comparatively new to Python, Stats and using DS libraries, my requirement is to run a multicollinearity test on a dataset having n number of columns and ensure the columns/variables having VIF > 5 are dropped altogether.

I found a code which is,

 from statsmodels.stats.outliers_influence import variance_inflation_factor

    def calculate_vif_(X, thresh=5.0):

        variables = range(X.shape[1])
        tmp = range(X[variables].shape[1])
        print(tmp)
        dropped=True
        while dropped:
            dropped=False
            vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]

            maxloc = vif.index(max(vif))
            if max(vif) > thresh:
                print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                del variables[maxloc]
                dropped=True

        print('Remaining variables:')
        print(X.columns[variables])
        return X[variables]

But, I do not clearly understand, should I pass the dataset altogether in the X argument's position? If yes, it is not working.

Please help!

Cornute answered 12/1, 2018 at 9:43 Comment(3)

When you say, "it is not working," what does that mean? What is the output? – Obedient 12/1, 2018 at 16:20

Might be a duplicate of stats.stackexchange.com/a/253620/19676. That answer's code works correctly for me when passing the entire dataset. – Rhyne 10/9, 2018 at 17:56

Hi, did you ever find an answer for this? Can you post what you ended up using/doing? – Accoutre 2/1, 2021 at 16:57

I tweaked with the code and managed to achieve the desired result by the following code, with a little bit of Exception Handling -

def multicollinearity_check(X, thresh=5.0):
    data_type = X.dtypes
    # print(type(data_type))
    int_cols = \
    X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
    total_cols = X.shape[1]
    try:
        if int_cols != total_cols:
            raise Exception('All the columns should be integer or float, for multicollinearity test.')
        else:
            variables = list(range(X.shape[1]))
            dropped = True
            print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
            It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
            while dropped:
                dropped = False
                vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
                print('\n\nvif is: ', vif)
                maxloc = vif.index(max(vif))
                if max(vif) > thresh:
                    print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                    # del variables[maxloc]
                    X.drop(X.columns[variables[maxloc]], 1, inplace=True)
                    variables = list(range(X.shape[1]))
                    dropped = True

            print('\n\nRemaining variables:\n')
            print(X.columns[variables])
            # return X.iloc[:,variables]
            return X
    except Exception as e:
        print('Error caught: ', e)

Cornute answered 16/2, 2018 at 11:55 Comment(0)

I also had issues running something similar. I fixed it by changing how variables was defined and finding another way of deleting its elements.

The following script should work with Anaconda 5.0.1 and Python 3.6 (the latest version as of this writing).

import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor    
from joblib import Parallel, delayed

# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
    variables = [X.columns[i] for i in range(X.shape[1])]
    dropped=True
    while dropped:
        dropped=False
        print(len(variables))
        vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables.pop(maxloc)
            dropped=True

    print('Remaining variables:')
    print([variables])
    return X[[i for i in variables]]

X = df[feature_list] # Selecting your data

X2 = calculate_vif_(X,5) # Actually running the function

If you have many features it will take very long to run. So I made another change to have it work in parallel in case you have multiple CPUs available.

Enjoy!

Czarevitch answered 2/2, 2018 at 15:44 Comment(1)

Thanks a TON @DanSan. I completed the code during that time itself, but didn't take care of parallelism. Great! – Cornute 16/2, 2018 at 11:48

I tweaked with the code and managed to achieve the desired result by the following code, with a little bit of Exception Handling -

def multicollinearity_check(X, thresh=5.0):
    data_type = X.dtypes
    # print(type(data_type))
    int_cols = \
    X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
    total_cols = X.shape[1]
    try:
        if int_cols != total_cols:
            raise Exception('All the columns should be integer or float, for multicollinearity test.')
        else:
            variables = list(range(X.shape[1]))
            dropped = True
            print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
            It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
            while dropped:
                dropped = False
                vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
                print('\n\nvif is: ', vif)
                maxloc = vif.index(max(vif))
                if max(vif) > thresh:
                    print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                    # del variables[maxloc]
                    X.drop(X.columns[variables[maxloc]], 1, inplace=True)
                    variables = list(range(X.shape[1]))
                    dropped = True

            print('\n\nRemaining variables:\n')
            print(X.columns[variables])
            # return X.iloc[:,variables]
            return X
    except Exception as e:
        print('Error caught: ', e)

Cornute answered 16/2, 2018 at 11:55 Comment(0)

Firstly, thanks to @DanSan for including the idea of Parallelization in Multicollinearity computation. Now I have an at least 50% improvement in the computation time for a multi-dimensional dataset of shape (22500, 71). But I have faced one interesting challenge on a dataset I was working on. The dataset actually contains some categorical columns, which I have Binary encoded using Category-encoders, as a result of which some columns have got just 1 unique value. And for such columns, the value of VIF is non-finite or NaN !

The following snapshot shows the VIF values for some of the 71 binary-encoded columns in my dataset:

In these situations, the number of columns that will remain after using the codes by @Aakash Basu and @DanSan might sometimes become dependent on the order of the columns in the dataset, as per my bitter experience, since columns are dropped linearly based on the max VIF value. And columns with just one value is a bit stupid for any Machine Learning model, as it will forcibly impose a biasness into the system !

In order to handle this issue, you can use the following updated code:

from joblib import Parallel, delayed
from statsmodels.stats.outliers_influence import variance_inflation_factor

def removeMultiColl(data, vif_threshold = 5.0):
    for i in data.columns:
        if data[i].nunique() == 1:
            print(f"Dropping {i} due to just 1 unique value")
            data.drop(columns = i, inplace = True)
    drop = True
    col_list = list(data.columns)
    while drop == True:
        drop = False
        vif_list = Parallel(n_jobs = -1, verbose = 5)(delayed(variance_inflation_factor)(data[col_list].values, i) for i in range(data[col_list].shape[1]))
        max_index = vif_list.index(max(vif_list))
        if vif_list[max_index] > vif_threshold:
            print(f"Dropping column : {col_list[max_index]} at index - {max_index}")
            del col_list[max_index]
            drop = True
    print("Remaining columns :\n", list(data[col_list].columns))
    return data[col_list]

Best of luck!

Northumberland answered 10/4, 2021 at 11:58 Comment(0)

Recommended topics

Hot tags