Variance inflation factor in ridge regression in python
Asked Answered
A

1

9

I'm running a ridge regression on somewhat collinear data. One of the methods used to identify a stable fit is a ridge trace and thanks to the great example on scikit-learn, I'm able to do that. Another method is to calculate variance inflation factors (VIFs) for each variable as k increases. When the VIFs decrease to <5 it is an indication the fit is satisfactory. Statsmodels has code for VIFs, but it is for an OLS regression. I've attempted to alter it to handle a ridge regression.

I'm checking my results against Regression Analysis by Example, 5th edition, chapter 10. My code generates the correct results for k = 0.000, but not after that. Working SAS code is available, but I'm not a SAS user and I don't know the differences between that implementation and scikit-learn's (and/or statsmodels's).

I've been stuck on this for a few days so any help would be much appreciated.

#http://www.ats.ucla.edu/stat/sas/examples/chp/chp_ch10.htm

from __future__ import division
import numpy as np
import pandas as pd
example = pd.read_csv('by_example_import.csv')
example.dropna(inplace=True)

from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(example)
scaler.transform(example)

X = example.drop(['year', 'import'], axis=1)
#c_matrix = X.corr()
y = example['import']
#w, v = np.linalg.eig(c_matrix)

import pylab as pl
from sklearn import linear_model

###############################################################################
# Compute paths

alphas = [0.000, 0.001, 0.003, 0.005, 0.007, 0.009, 0.010, 0.012, 0.014, 0.016, 0.018,
          0.020, 0.022, 0.024, 0.026, 0.028, 0.030, 0.040, 0.050, 0.060, 0.070, 0.080,
          0.090, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
clf = linear_model.Ridge(fit_intercept=False)
clf2 = linear_model.Ridge(fit_intercept=False)
coefs = []
vif_list = [[] for x in range(X.shape[1])]
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)

    for j, data in enumerate(X.columns):
        cols = [col for col in X.columns if col not in [data]]
        Z = X[cols]
        yy = X.iloc[:,j]
        clf2.set_params(alpha=a)
        clf2.fit(Z, yy)

        r_squared_j = clf2.score(Z, yy)
        vif = 1. / (1. - r_squared_j)
        print r_squared_j
        vif_list[j].append(vif)

pd.DataFrame(vif_list, columns = alphas).T
pd.DataFrame(coefs, index=alphas)

###############################################################################
# Display results

ax = pl.gca()
ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])

ax.plot(alphas, coefs)
pl.vlines(ridge_cv.alpha_, np.min(coefs), np.max(coefs), linestyle='dashdot')
pl.xlabel('alpha')
pl.ylabel('weights')
pl.title('Ridge coefficients as a function of the regularization')
pl.axis('tight')
pl.show()
Antlion answered 14/5, 2014 at 16:22 Comment(1)
I just read an article from 1970 that mentions VIF for Ridge. The covariance matrix for the parameter estimates of a Ridge regression has a sandwich form, and I think you cannot directly use the same pattern as for OLS. If you don't get a faster answer, I should have code targeted for statsmodels within a few days.Carney
C
10

Variance inflation factor for Ridge regression is just three lines. I checked it with the example on the UCLA statistics page.

A variation of this will make it into the next statsmodels release. Here is my current function:

def vif_ridge(corr_x, pen_factors, is_corr=True):
    """variance inflation factor for Ridge regression

    assumes penalization is on standardized variables
    data should not include a constant

    Parameters
    ----------
    corr_x : array_like
        correlation matrix if is_corr=True or original data if is_corr is False.
    pen_factors : iterable
        iterable of Ridge penalization factors
    is_corr : bool
        Boolean to indicate how corr_x is interpreted, see corr_x

    Returns
    -------
    vif : ndarray
        variance inflation factors for parameters in columns and ridge
        penalization factors in rows

    could be optimized for repeated calculations
    """
    corr_x = np.asarray(corr_x)
    if not is_corr:
        corr = np.corrcoef(corr_x, rowvar=0, bias=True)
    else:
        corr = corr_x

    eye = np.eye(corr.shape[1])
    res = []
    for k in pen_factors:
        minv = np.linalg.inv(corr + k * eye)
        vif = minv.dot(corr).dot(minv)
        res.append(np.diag(vif))
    return np.asarray(res)
Carney answered 14/5, 2014 at 21:42 Comment(5)
This is incredibly useful. I can't thank you enough. I'm looking forward to seeing this in the next release of statsmodels!Antlion
did it get released in the next version of statsmodels ( can I just run model.vif to get the vif for each variable in the model formula?)Withers
This implementation leads to the problematic issue of VIFs smaller than 1, as pointed out in this question on Stats SE: stats.stackexchange.com/a/514340/277115 In my answer, I link to a publication that proposes an alternate definition of the VIFs in the case of a ridge regression; perhaps you would like to implement this instead?Kismet
@Kismet I had looked at that article. I didn't see a problem with vif < 1. It reflects the penalization. If we penalize strongly enough, then the parameters become just a constant equal to the prior or penalization target. It's the vif that is left after using Ridge to work around the multicollinearity.Carney
@Carney Thanks for your response. So according to this interpretation, a VIF less than one would be a variance "deflation" factor due to the penalisation, no?Kismet

© 2022 - 2024 — McMap. All rights reserved.