What does the capital letter "I" in R linear regression formula mean?
Asked Answered
T

3

45

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.

What does the "I" do in a model like this?

data(rock)
lm(area~I(peri - mean(peri)), data = rock)

Considering that the following does NOT work:

lm(area ~ (peri - mean(peri)), data = rock)

and that this does work:

rock$peri - mean(rock$peri)

Any key words on how to research this myself would also be very helpful.

Titicaca answered 12/6, 2014 at 19:26 Comment(13)
There is excellent documentation in R. Read help("I").Entropy
Yes, thanks, I saw that. That doesn't entirely answer why the special treatment is necessary inside a linear model but not outside of one. If the answer is "that's just how R works" then I suppose that counts.Titicaca
@StephanKolassa Of course, but I got in the habit of using the more verbose command on SO because ?[ doesn't work.Entropy
"In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators." is pretty clear. There is even a helpful link to the documentation of formula.Entropy
@Roland, ?'[' or ?[Miliary
@Miliary Sure, but a newby doesn't know that after being shown ?I.Entropy
I seem to have exhausted the elaboration and expansion I'm going to get. Thank you for your help.Titicaca
To return to the original question: section 11.1 in "An Introduction to R" (ships with your R installation, look under the help menu) gives a few hints. It essentially gives the mnemonic that I() = insulate. May be helpful. And I'll agree that the documentation on I() is, um, terse.Enright
Compare your lm code with: my.peri <- (rock$peri - mean(rock$peri)); lm(rock$area ~ my.peri);Interpolate
Is "insulating" in addition to parenthesis required because it's in a function or moreso because the classes of peri and area are different?Titicaca
@Titicaca This is nothing to do with classes of elements and all to do with - having special meaning in a formula. The parentheses are there because I is a function so you need them just like you need them on mean(). It also (but this effect is secondary) visually indicates what is being protected from the formula parsing code.Buoyage
It provides an additional _I_nterpretation step.Acrid
Does this answer your question? In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)Ferde
B
73

I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.

For example:

y ~ x + x^2

would, to R, mean "give me:

  1. x = the main effect of x, and
  2. x^2 = the main effect and the second order interaction of x",

not the intended x plus x-squared:

> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
           y           x
1 -1.4355144 -1.85374045
2  0.3620872 -0.07794607
3 -1.7590868  0.96856634
4 -0.3245440  0.18492596
5 -0.6515630 -1.37994358

This is because ^ is a special operator in a formula, as described in ?formula. You end up only including x in the model frame because the main effect of x is already included from the x term in the formula, and there is nothing to cross x with to get the second-order interactions in the x^2 term.

To get the usual operator, you need to use I() to isolate the call from the formula code:

> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
            y          x       I(x^2)
1 -0.02881534  1.0865514 1.180593....
2  0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458  0.8344739 0.696346....
5  0.65522764 -0.9676520 0.936350....

(that last column is correct, it just looks odd because it is of class AsIs.)

In your example, - when used in a formula would indicate removal of a term from the model, where you wanted - to have it's usual binary operator meaning of subtraction:

> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5),  : 
  variable lengths differ (found for 'mean(x)')

This fails for reason that mean(x) is a length 1 vector and model.frame() quite rightly tells you this doesn't match the length of the other variables. A way round this is I():

> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
           y I(x - mean(x))
1  1.1727063   1.142200....
2 -1.4798270   -0.66914....
3 -0.4303878   -0.28716....
4 -1.0516386   0.542774....
5  1.5225863   -0.72865....

Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( ).

Read ?formula for more on the special operators, and ?I for more details on the function itself and its other main use-case within data frames (which is where the AsIs bit originates from, if you are interested).

Buoyage answered 12/6, 2014 at 19:43 Comment(4)
Excellent answer, I tried X:X instead of X^2 but it still did work, you know why?Calash
What were you expecting I(X:X) to do? I assume it's going to try to apply the sequence operator, as in seq(from = X, to = X, by = 1L). But that doesn't make any sort of sense to me.Buoyage
Well, does X: Y in a formula mean interaction term between X and Y?Calash
Yes X:Y (not wrapped in I()) means interaction between X and Y. And this is the point; : and ^ and some other operators have different uses/interpretations within a formula. If you want the usual non-formula interpretation you need to wrap the thing in I(). I don't think X:X is going to do anything because it doesn't literally mean X * X as that doesn't work for factor variables. : means interaction.Buoyage
Q
3

From the docs:

Function I has two main uses.

  • In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.

To address this point:

df1 <- data.frame(stringi = I("dog"))
df2 <- data.frame(stringi = "dog")

str(df1)
str(df2)
  • In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.

To address this point:

lm(mpg ~ disp + drat, mtcars)
lm(mpg ~ I(disp + drat), mtcars)

Second line. "Creates a new predictor" that is the literal sum of disp + drat

Quincey answered 25/9, 2018 at 10:0 Comment(0)
S
0

Thanks, everyone. I'm still confused, though. The documentation for formula says:

"The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions."

So I would think that this code:

lm(Y ~ X^2)

would give three coefficients: (1) an intercept, (2) a coefficient for the first order term, X, and (3) a coefficient for the quadratic term, X^2. It does not.

X1 <- runif(100)
X1_2 <- X1^2

set.seed(61)
Y1 <- 5*X1 + -4.5*X1^2 + rnorm(100,0,.05)

The above code gives:

> lm(Y1 ~ X1^2)
> 
> Call: lm(formula = Y1 ~ X1^2)
> 
> Coefficients: (Intercept)           X1  
>      1.0486       0.0343

Which is the same as having no quadratic term:

> > lm(Y1 ~ X1)
> 
> Call: lm(formula = Y1 ~ X1)
> 
> Coefficients: (Intercept)           X1  
>      1.0486       0.0343

When I use the function I, I get this:

> > lm(Y1 ~ I(X1^2))
> 
> Call: lm(formula = Y1 ~ I(X1^2))
> 
> Coefficients: (Intercept)      I(X1^2)  
>      1.0602      -0.1145  ,

which captures only the downward curve of the data generating process.

It looks like to get both the main term and the interaction, put both terms explicitly in the model:

> > lm(Y1 ~ X1 + I(X1^2))
> 
> Call: lm(formula = Y1 ~ X1 + I(X1^2))
> 
> Coefficients: (Intercept)           X1      I(X1^2)     -0.009738    
> 5.021233    -4.518124

This seems unnecessarily cumbersome and doesn't seem to be quite what the documentation promises, though I'm sure I've misunderstood it. (Probably the poly function is best for all this, anyway.)

Sorry if I'm a bit obtuse; it's late, I went down a silly rabbit hole ...

Sandeesandeep answered 19/9, 2023 at 8:37 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.