Any pitfalls to using programmatically constructed formulas?
Asked Answered
L

1

7

I'm wanting to run through a long vector of potential explanatory variables, regressing a response variable on each in turn. Rather than paste together the model formula, I'm thinking of using reformulate(), as demonstrated here.

The function fun() below seems to do the job, fitting the desired model. Notice, though, that it records in its call element the name of the constructed formula object rather than its value.

## (1) Function using programmatically constructed formula
fun <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    lm(ff, data=mtcars)
}
fun(XX=c("cyl", "disp"))
# 
# Call:
# lm(formula = ff, data = mtcars)                 <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

## (2) Result of directly specified formula (just for purposes of comparison)
lm(mpg ~ cyl + disp, data=mtcars)
# 
# Call:
# lm(formula = mpg ~ cyl + disp, data = mtcars)   <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

My question: Is there any danger in this? Can this become a problem if, for instance, I want to later apply update, or predict or some other function to the model fit object, (possibly from some other environment)?

A slightly more awkward alternative that does, nevertheless, get the recorded call right is to use eval(substitute()). Is this in any way a generally safer construct?

fun2 <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    eval(substitute(lm(FF, data=mtcars), list(FF=ff)))
}
fun2(XX=c("cyl", "disp"))$call
## lm(formula = mpg ~ cyl + disp, data = mtcars)
Lehrer answered 30/6, 2013 at 23:18 Comment(4)
I think the issue will being careful using update see [#13690684. You can usually find out these pitfalls by trying to use data.table -- see [#15097311Algid
This works: do.call("lm", list(ff, quote(mtcars)))Efficient
@G.Grothendieck -- Thanks. That looks good, and seems like it shouldn't pose any problems down the road. (Also, thanks mnel for the interesting links.)Superannuation
It is also worth looking how add1.lm works (using . ~ .+ and update.formula and ti to remember (or notice) that reformulate is just a wrapper for the paste method.Algid
L
2

I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.

It turns out that the printed call is a bit of red herring.

The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)

To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.

## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>

formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp

formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>

## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))  
## mpg ~ cyl + disp

To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.

Lehrer answered 3/7, 2013 at 21:50 Comment(4)
Nice exploration. I've used the do.call method previously because of some of these same questions. It's nice to know these functions don't depend on the stored formula/call. However, I do believe that update does, so watch out for that usage.Janicejanicki
@Aaron -- Thanks. Are you referring to the same issue as mnel did in the first comment following my question? If so, am I right in understanding that that pitfall would apply equally to any of the calls above? (i.e. using do.call() to launch the call to lm() doesn't help you avoid update()'s scoping issues, right?)Superannuation
After further exploration, looks like update is safe too. I was sure it wasn't though, but maybe I've been wrong all this time. However, there are functions that do depend on the function being correct, for example, anova.lme; see https://mcmap.net/q/920420/-anova-test-fails-on-lme-fits-created-with-pasted-formula/210673.Janicejanicki
@Aaron -- Nice catch. By pointing me back to that question, I think you helped me locate and fix the bit of nlme's code that caused the OP's call to anova() to fail. (See my comment here for more details.)Superannuation

© 2022 - 2024 — McMap. All rights reserved.