In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)
Asked Answered
B

1

64

I'm trying to get my head around the use of the tilde operator, and associated functions. My 1st question is why does I() need to be used to specify arithmetic operators? For example, these 2 plots generate different results (the former having a straight line, and the latter the expected curve)

x <- c(1:100)
y <- seq(0.1,10,0.1)

plot(y~x^3)
plot(y~I(x^3))

further, both of the following plots also generate the expected result

plot(x^3, y)
plot(I(x^3), y)

My second question is, perhaps the examples I've been using are too simple, but I don't understand where ~ should actually be used.

Balaklava answered 8/11, 2011 at 18:42 Comment(5)
Any excellent answer to this question will draw heavily on what is contained in ?formula.Merimerida
Duplicate; we should close one of these and make the other canonical: What does the capital letter “I” in R linear regression formula mean?Synovia
@Synovia : You might be right. However, neither of these offered the more statistically correct use of poly until I noticed that glaring omission from my answer and included it. It's such a different slant on the general topic of formulas in R that I'm going to add a separate answer.Decompound
@IRTFM: ok then since we can't close older into newer, either we close that into this or at least leave a comment there referencing this question as related/duplicateSynovia
There's a better description of statistical issues involved in polynomial models in regression procedures by @Achim Zeileis: #30000400Decompound
D
75

The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.

The issue here is how formulas and specifically the "+, ":", and "^" operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function poly when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators "+", "*", ":" and "^" have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (~) separates the left hand side from the right hand side. The ^ and : operators are used to construct interactions so x = x^2 = x^3 rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed (x+y)^2 the R interpreter would have produced (for its own good internal use), not a mathematical: x^2 +2xy +y^2 , but rather a symbolic: x + y +x:y where x:y is an interaction term without its main effects. (The ^ gives you both main effects and interactions.)

?formula

The I() function acts to convert the argument to "as.is", i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.

The ~ should be thought of as saying "is distributed as" or "is dependent on" when seen in regression functions. The ~ is an infix function in its own right. You can see that LHS ~ RHS is almost shorthand for formula(LHS, RHS) by typing this at the console:

`~`(LHS,RHS)
#LHS ~ RHS

class( `~`(LHS,RHS) )
#[1] "formula"

identical( `~`(LHS,RHS), as.formula("LHS~RHS") )
#[1] TRUE   # cannot use `formula` since it interprets its first argument

In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for family. The mean for the base level will generally be labelled (Intercept). The function context and arguments may also further determine a link function such as log() or logit() from the family value, and it is also possible to have a non-canonical family/link combination.

The "+" symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.

In plot()-ting functions it basically reverses the usual ( x, y ) order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more "mathematical" mode of communicating with R. In the graphics::plot.formula, curve, and 'lattice' and 'ggplot' functions, it governs how multiple factors or numeric vectors are displayed and "facetted".

The overloading of the "+" operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use "+" as an "arrangement" and grouping operator.

Decompound answered 8/11, 2011 at 18:59 Comment(4)
I had already read ?formula (although that wasn't clear from my question); what confused me there is the concept of operators in model formulation. For example, I'm totally lost as to how the + symbol can mean something other than to add two values togetherBalaklava
In a formula within a regression function you are implicitly asking to return a set of (estimated) coefficients associated with (usually multiplied by) each of the terms connected by "+"'s.Decompound
@Balaklava The + operator is overloaded within the context of a formula. It's done to give a more intuitive feel to formula specifications. Otherwise regression calls would look like lm( formula=formula(y.var,x.var1,x.var2) ) which is less easy to understand.Bustard
It's not just the operators that mean different things; it's more fundamentally the symbols. Usually if you type x + y, the symbols x and y are evaluated and their values summed together. In a formula context such as z ~ x + y, the symbols do not get evaluated, but the formula refers to these actual symbols. There are various operators for constructing formulas from symbols, and e.g. symbol + symbol does not mean the same thing as value + value.Fugate

© 2022 - 2024 — McMap. All rights reserved.