python stats models - quadratic term in regression

D

3

27

I have the following linear regression:

import statsmodels.formula.api as sm

model = sm.ols(formula = 'a ~ b + c', data = data).fit()

I want to add a quadratic term for b in this model.

Is there a simple way to do this with statsmodels.ols?
Is there a better package I should be using to achieve this?

Demolish answered 13/8, 2015 at 3:18 Comment(0)

W

40

Although the solution by Alexander is working, in some situations it is not very convenient. For example, each time you want to predict the outcome of the model for new values, you need to remember to pass both b**2 and b values which is cumbersome and should not be necessary. Although patsy does not recognize the notation "b**2", it does recognize numpy functions. Thus, you can use

import statsmodels.formula.api as sm
import numpy as np

data = {"a":[2, 3, 5], "b":[2, 3, 5], "c":[2, 3, 5]}
model = sm.ols(formula = 'a ~ np.power(b, 2) + b + c', data = data).fit()

In this way, latter, you can reuse this model without the need to specify a value for b**2

model.predict({"a":[1, 2], "b":[5, 2], "c":[2, 4]})

Weathercock answered 24/8, 2015 at 20:44 Comment(2)

I know I'm late to the party here, but what does the tilde ~ mean in formula = 'a ~ np.power(b, 2) + b + c'? – Mortality 9/12, 2018 at 23:41

@Mortality like in R programming language, the ~ means that a is a linear combination of np.power(b, 2) + b + c – Moppet 25/2, 2019 at 1:53

H

51

The simplest way is

model = sm.ols(formula = 'a ~ b + c + I(b**2)', data = data).fit()

The I(...) basically says "patsy, please stop being clever here and just let Python handle everything inside kthx". (More detailed explanation)

Hasten answered 11/4, 2016 at 3:3 Comment(1)

Doesn't work in Statsmodels 0.21.1: says "patsy.PatsyError: Error evaluating factor: ValueError: no field of name I" – Towpath 31/1, 2021 at 15:17

W

40

Although the solution by Alexander is working, in some situations it is not very convenient. For example, each time you want to predict the outcome of the model for new values, you need to remember to pass both b**2 and b values which is cumbersome and should not be necessary. Although patsy does not recognize the notation "b**2", it does recognize numpy functions. Thus, you can use

import statsmodels.formula.api as sm
import numpy as np

data = {"a":[2, 3, 5], "b":[2, 3, 5], "c":[2, 3, 5]}
model = sm.ols(formula = 'a ~ np.power(b, 2) + b + c', data = data).fit()

In this way, latter, you can reuse this model without the need to specify a value for b**2

model.predict({"a":[1, 2], "b":[5, 2], "c":[2, 4]})

Weathercock answered 24/8, 2015 at 20:44 Comment(2)

I know I'm late to the party here, but what does the tilde ~ mean in formula = 'a ~ np.power(b, 2) + b + c'? – Mortality 9/12, 2018 at 23:41

@Mortality like in R programming language, the ~ means that a is a linear combination of np.power(b, 2) + b + c – Moppet 25/2, 2019 at 1:53

P

2

This should work:

data['b2'] = data.b ** 2
model = sm.ols(formula = 'a ~ b2 + b + c', data=data).fit()

Poole answered 13/8, 2015 at 3:32 Comment(2)

do you know if this depends on a certain version? for me the b**2 term is just skipped – Demolish 13/8, 2015 at 3:38

Creating a design matrix from the formulas is done by patsy and will be independent of the statsmodels version. (I don't know how patsy treats the power operation in a formula.) – Generative 13/8, 2015 at 4:34

Recommended topics

Hot tags