Kernel ridge and simple Ridge with Polynomial features

Asked 29/9, 2018 at 23:0 Answered 4/12, 2020 at 22:22

What is the difference between Kernel Ridge (from sklearn.kernel_ridge) with polynomial kernel and using PolynomialFeatures + Ridge (from sklearn.linear_model)?

Superheat answered 29/9, 2018 at 23:0 Comment(0)

The difference is in feature computation. PolynomialFeatures explicitly computes polynomial combinations between the input features up to the desired degree while KernelRidge(kernel='poly') only considers a polynomial kernel (a polynomial representation of feature dot products) which will be expressed in terms of the original features. This document provides a good overview in general.

Regarding the computation we can inspect the relevant parts from the source code:

Ridge Regression
- The actual computation starts here (for the default settings); you can compare with equation (5) in the above linked document. The computation involves computing the dot product between feature vectors (the kernel), then the dual coefficients (alpha) and finally a dot product with the feature vectors in order to obtain the weights.
Kernel Ridge
- Similarly computes the dual coefficients and stores them (instead of computing some weights). This is because when making predictions, again the kernel between training and prediction samples is computed. The result is then dotted with the dual coefficients.

The computation of the (training) kernel follows a similar procedure: compare Ridge and KernelRidge. The major difference is that Ridge explicitly considers the dot product between whatever (polynomial) features it has received while for KernelRidge these polynomial features are generated implicitly during the computation. For example consider a single feature x; with gamma = coef0 = 1 the KernelRidge computes (x**2 + 1)**2 == (x**4 + 2*x**2 + 1). If you consider now PolynomialFeatures this will provide features x**2, x, 1 and the corresponding dot product is x**4 + x**2 + 1. Hence the dot product differs by a term x**2. Of course we could rescale the poly-features to have x**2, sqrt(2)*x, 1 while with KernelRidge(kernel='poly') we don't have this kind of flexibility. On the other hand the difference probably doesn't matter (in most cases).

Note that also the computation of the dual coefficients is performed in a similar manner: Ridge and KernelRidge. Finally KernelRidge keeps the dual coefficients while Ridge directly computes the weights.

Let's see a small example:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils.extmath import safe_sparse_dot

np.random.seed(20181001)

a, b = 1, 4
x = np.linspace(0, 2, 100).reshape(-1, 1)
y = a*x**2 + b*x + np.random.normal(scale=0.2, size=(100,1))

poly = PolynomialFeatures(degree=2, include_bias=True)
xp = poly.fit_transform(x)
print('We can see that the new features are now [1, x, x**2]:')
print(f'xp.shape: {xp.shape}')
print(f'xp[-5:]:\n{xp[-5:]}', end='\n\n')
# Scale the `x` columns so we obtain similar results.
xp[:, 1] *= np.sqrt(2)

ridge = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge.fit(xp, y)

krr = KernelRidge(alpha=0, kernel='poly', degree=2, gamma=1, coef0=1)
krr.fit(x, y)

# Let's try to reproduce some of the involved steps for the different models.
ridge_K = safe_sparse_dot(xp, xp.T)
krr_K = krr._get_kernel(x)
print('The computed kernels are (alomst) similar:')
print(f'Max. kernel difference: {np.abs(ridge_K - krr_K).max()}', end='\n\n')
print('Predictions slightly differ though:')
print(f'Max. difference: {np.abs(krr.predict(x) - ridge.predict(xp)).max()}', end='\n\n')

# Let's see if the fit changes if we provide `x**2, x, 1` instead of `x**2, sqrt(2)*x, 1`.
xp_2 = xp.copy()
xp_2[:, 1] /= np.sqrt(2)
ridge_2 = Ridge(alpha=0, fit_intercept=False, solver='cholesky')
ridge_2.fit(xp_2, y)
print('Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:')
print(f'Max. difference: {np.abs(ridge_2.predict(xp_2) - ridge.predict(xp)).max()}', end='\n\n')
print('Interpretability of the coefficients changes though:')
print(f'ridge.coef_[1:]: {ridge.coef_[0, 1:]}, ridge_2.coef_[1:]: {ridge_2.coef_[0, 1:]}')
print(f'ridge.coef_[1]*sqrt(2): {ridge.coef_[0, 1]*np.sqrt(2)}')
print(f'Compare with: a, b = ({a}, {b})')

plt.plot(x.ravel(), y.ravel(), 'o', color='skyblue', label='Data')
plt.plot(x.ravel(), ridge.predict(xp).ravel(), '-', label='Ridge', lw=3)
plt.plot(x.ravel(), krr.predict(x).ravel(), '--', label='KRR', lw=3)
plt.grid()
plt.legend()
plt.show()

From which we obtain:

We can see that the new features are now [x, x**2]:
xp.shape: (100, 3)
xp[-5:]:
[[1.         1.91919192 3.68329762]
 [1.         1.93939394 3.76124885]
 [1.         1.95959596 3.84001632]
 [1.         1.97979798 3.91960004]
 [1.         2.         4.        ]]

The computed kernels are (alomst) similar:
Max. kernel difference: 1.0658141036401503e-14

Predictions slightly differ though:
Max. difference: 0.04244651134471766

Using features "[x**2, x, 1]" instead of "[x**2, sqrt(2)*x, 1]" predictions are (almost) the same:
Max. difference: 7.15642822779472e-14

Interpretability of the coefficients changes though:
ridge.coef_[1:]: [2.73232239 1.08868872], ridge_2.coef_[1:]: [3.86408737 1.08868872]
ridge.coef_[1]*sqrt(2): 3.86408737392841
Compare with: a, b = (1, 4)

Sponger answered 1/10, 2018 at 9:54 Comment(0)

this is an example to show it:

    from sklearn.datasets import make_friedman1
    plt.figure()
    plt.title('Complex regression problem with one input variable')
    X_F1, y_F1 = make_friedman1(n_samples = 100,
                           n_features = 7, random_state=0)
    from sklearn.linear_model import LinearRegression
    from sklearn.linear_model import Ridge
    from sklearn.preprocessing import PolynomialFeatures 

    print('\nNow we transform the original input data to add\n\
    polynomial features up to degree 2 (quadratic)\n')
    poly = PolynomialFeatures(degree=2)
    X_F1_poly = poly.fit_transform(X_F1) 
    X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
                                                       random_state = 0)
    linreg = Ridge().fit(X_train, y_train)

    print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
         .format(linreg.coef_))
    print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
         .format(linreg.intercept_))
    print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
         .format(linreg.score(X_train, y_train)))
    print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
         .format(linreg.score(X_test, y_test)))

(poly deg 2 + ridge) linear model coeff (w):
[ 0.    2.23  4.73 -3.15  3.86  1.61 -0.77 -0.15 -1.75  1.6   1.37  2.52
  2.72  0.49 -1.94 -1.63  1.51  0.89  0.26  2.05 -1.93  3.62 -0.72  0.63
 -3.16  1.29  3.55  1.73  0.94 -0.51  1.7  -1.98  1.81 -0.22  2.88 -0.89]
(poly deg 2 + ridge) linear model intercept (b): 5.418
(poly deg 2 + ridge) R-squared score (training): 0.826
(poly deg 2 + ridge) R-squared score (test): 0.825

Cramer answered 5/10, 2018 at 8:34 Comment(0)

I assume you have known how the kernel ridge regression (KRR) and PolynomialFeatures + Ridge work. They are somewhat the same. I will list some mirror differences between them.

You can switch off the bias feature in PolynomialFeatures, and include it in the Ridge. The regularization term of Ridge doesn't include the bias. On the contrary, for KRR of sklearn, the penalty term always includes the bias term.
You can scale the features generated by PolynomialFeatures before you use Ridge. it's equal to customize the regularization strength for each polynomial feature. So PolynomialFeatures = Ridge is little more flexible. On the contrary, you have only two parameters to tune in the polynomial kernel, i.e. the gamma and the c_0, see polynomial kernel.
The fit and prediction time is different. You need to solve the system of linear equations K_NxN x=y$ in KRR. You need only to solve the system of linear equations A_Nx(D+1) x=y$, where N is the number of samples in training, and D the degree of the polynomial.
(This is a very very corner case) Kernel will be (almost) singular if two samples are (near) identical. And when alpha (regularization strength) is very small. you will meet the numerical stability problem. since the K + alpha*I is almost singular. You can only overcome this problem by using the Ridge. The reason why Ridge will work is explained in many machine learning textbooks.

Capwell answered 4/12, 2020 at 22:22 Comment(0)

Recommended topics

Hot tags