How to add interaction term in Python sklearn
Asked Answered
S

3

32

If I have independent variables [x1, x2, x3] If I fit linear regression in sklearn it will give me something like this:

y = a*x1 + b*x2 + c*x3 + intercept

Polynomial regression with poly =2 will give me something like

y = a*x1^2 + b*x1*x2 ......

I don't want to have terms with second degree like x1^2.

how can I get

y = a*x1 + b*x2 + c*x3 + d*x1*x2

if x1 and x2 have high correlation larger than some threshold value j .

Scripture answered 23/8, 2017 at 0:47 Comment(0)
B
40

For generating polynomial features, I assume you are using sklearn.preprocessing.PolynomialFeatures

There's an argument in the method for considering only the interactions. So, you can write something like:

poly = PolynomialFeatures(interaction_only=True,include_bias = False)
poly.fit_transform(X)

Now only your interaction terms are considered and higher degrees are omitted. Your new feature space becomes [x1,x2,x3,x1*x2,x1*x3,x2*x3]

You can fit your regression model on top of that

clf = linear_model.LinearRegression()
clf.fit(X, y)

Making your resultant equation y = a*x1 + b*x2 + c*x3 + d*x1*x + e*x2*x3 + f*x3*x1

Note: If you have high dimensional feature space, then this would lead to curse of dimensionality which might cause problems like overfitting/high variance

Bluey answered 23/8, 2017 at 12:24 Comment(2)
Nice. I would further set include_bias=False because the bias column could lead to degeneracy problems with some estimators and LinearRegression adds its own intercept term anyway.Roti
Really helpful answer. It looks like there's a minor typo in your resultant equation: the 4th term on the right hand side should be d*x1*x2.Reviere
C
22

Use patsy to construct a design matrix as follows:

y, X = dmatrices('y ~ x1 + x2 + x3 + x1:x2', your_data)

Where your_data is e.g. a DataFrame with response column y and input columns x1, x2 and x3.

Then just call the fit method of your estimator, e.g. LinearRegression().fit(X,y).

Copland answered 9/11, 2017 at 15:43 Comment(1)
love the design after lm() in RHindenburg
R
12

If you do y = a*x1 + b*x2 + c*x3 + intercept in scikit-learn with linear regression, I assume you do something like that:

# x = array with shape (n_samples, n_features)
# y = array with shape (n_samples)

from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(x, y)

The independent variables x1, x2, x3 are the columns of feature matrix x, and the coefficients a, b, c are contained in model.coef_.

If you want an interaction term, add it to the feature matrix:

x = np.c_[x, x[:, 0] * x[:, 1]]

Now the first three columns contain the variables, and the following column contain the interaction x1 * x2. After fitting the model you will find that model.coef_ contains four coefficients a, b, c, d.

Note that this will always give you a model with interaction (it can theoretically be 0, though), regardless of the correlation between x1 and x2. Of course, you can measure the correlation beforehand and use it to decide which model to fit.

Roti answered 23/8, 2017 at 8:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.