Bizarre behaviour of lm() and predict.lm() depending on use of explicit namespace accessor
Asked Answered
H

2

8

I am interested in some disturbing behaviour of the lm function and the associated predict.lm function in R. The splines base package provides the function bs to generate b-spline expansions, which can then be used to fit a spline model using lm, a versatile linear model fitting function.

The lm and predict.lm functions have a lot of built-in convenience that take advantage of formulas and terms. If the call to bs() is nested inside the lm call, then the user can provide univariate data to predict, and this data will automatically be expanded into the appropriate b-spline basis. This expanded matrix of data will then predicted upon as usual.

library(splines)

x <- sort(runif(50, 0, 10))
y <- x^2

splineModel <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))

newData <- data.frame(x = 4)
prediction <- predict(splineModel, newData) # 16

plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
       pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))

As we see, this works perfectly:

enter image description here

The strangeness happens when one uses the :: operator to explicitly indicate that the bs function is exported from the namespace of the splines package. The following code snippet is identical except for that change:

library(splines)

x <- sort(runif(50, 0, 10))
y <- x^2

splineModel <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))

newData <- data.frame(x = 4) 
prediction <- predict(splineModel, newData) # 6.40171

plot(x, y)
lines(x, splineModel$fitted.values, col = 'blue3')
points(newData$x, prediction, pch = 3, cex = 3, col = 'red3')
legend("topleft", legend = c("Data", "Fitted Values", "Predicted Value"),
       pch = c(1, NA, 3), col = c('black', 'blue3', 'red3'), lty = c(NA, 1, NA))

enter image description here

The exact same results are produced in the second snippet if the splines package is never attached using library in the first place. I cannot think of another situation in which the use of the :: operator on an already-loaded package changes program behaviour.

The same behaviour arises using other functions from splines like the natural spline basis implementation ns. Interestingly, in both cases the "y hat" or fitted values are reasonable and match one another. The fitted model objects are identical except for names of attributes, as far as I can tell.

I have been unable to pin down the source of this behaviour. While this may read like a bug report, my questions are

  1. Why does this happen? I have been trying to follow through predict.lm but cannot pin down where the divergence occurs.
  2. Is this somehow intended behaviour, and if so where can I learn more about it?
Hairworm answered 19/4, 2017 at 20:1 Comment(3)
Another strange thing is that if you look at the coefficients from each model, they are the same, yet the predictions are different. By the way, you shouldn't create your data twice, because it will be different each time (unless you set the same seed each time). It doesn't make a difference here because the data are totally deterministic either way, resulting in the same model output, but better to set a seed and create the data only once.Eskill
You're right, it would have been better style to set a seed or re-use the data. But I wanted to emphasize that the second snippet is selt-contained and self-contradictory independent of the first -- there is no way the predicted value in the second plot should lie so far from the values fitted to the original data/Hairworm
And yes, the coefficients are identical, along with all numerical content in the two model object. The issue comes in somewhere in the prediction step that uses combination of the "call" and "terms" elements of the fitted model object to automatically expand the new x value into a b-spline vector.Hairworm
A
9

So the problem is that the model needs to keep track of the knots that were calculated with the original data and use those values when predicting new data. That typically happens in the model.frame() call inside the lm() call. The bs() function returns a class of "bs" and when making the model.frame, that column is dispatched to splines:::makepredictcall.bs to try to capture the boundary knots. (You can see the makepredictcall calls in the model.frame.default function.)

But if we compare the results

splineModel1 <- lm(y ~ bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel1), "predvar")
# list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots =  c(0.275912734214216, 
# 9.14309860439971), intercept = FALSE))

splineModel2 <- lm(y ~ splines::bs(x, y, degree = 3, knots = c(3, 6)))
attr(terms(splineModel2), "predvar")
# list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))

Notice how the second one doesn't capture the Boundary.knots. This is because of the splines:::makepredictcall.bs function which actually looks at the name of the call

function (var, call) {
    if (as.character(call)[1L] != "bs") 
        return(call)
    ...
}

When you use splines::bs in the formula, then as.character(call)[1L] returns "splines::bs" which does not match "bs" so nothing happens. It's unclear to me why this check is there. Seems like the method dispatching should be sufficient to assume it's a bs object.

In my opinion this does not seem like desired behavior and probably should be fixed. But the function bs() should not really be called without loading the package because functions like makepredictcall.bs don't be imported either so the custom dispatching for those objects would be broken.

Archaic answered 19/4, 2017 at 20:56 Comment(9)
Ah, excellent. It does seem strange that the "dispatching" there is done with a string comparison. Still seems odd that my particular prediction at x = 4 would be affected, considering my newly predicted point was not only within the boundary knots, but within the interior knots. The position of the Boundary knots shouldn't affect the estimation of such points.Hairworm
Well done. That said, I think it'd be pretty important to know why the code's author included that initial check before concluding that it's undesirable and should be fixed.Wanyen
@Hairworm It's not that odd -- as you dig into R's model-fitting machinery, you'll find that many many formula and terms-related operations use just that type of operation on character vectors deparsed and ultimately derived from the formula.Wanyen
@JoshO'Brien Right. I could see a user passing a bs object directly to a function so I can see why it's there. I guess the "problem" is really it doesn't also check for "splines::bs" or "splines:::bs"Archaic
Sure, but why not check the class of the object directly? Or is a string comparison necessary because all that is available at that stage is a string literal of the call? (It seems a better system would be to record the class/attributes of the object in the call separately from the string call. Then for example, one could assign B <- bs(..) and then call fit <- lm(y ~ B) and be able to use predict(fit, data.frame(x = 4))... but it's too late in the game for that.)Hairworm
@mb7744. Well, the check of the class already happened at dispatch. What it's trying to work around is the case where you do something like b<-bs(x, knots=c(3,4)); lm(y~b).Archaic
But that doesn't work, does it? By "work" I mean allow the user to provide just the vector 'x', not the expanded form.Hairworm
@Hairworm Well, i mean you can fit the model that way, but it does not make it easy to predict since there's nothing in the formula indicating where it came from. I'm sure there was some case out there that this was designed to avoid, but I don't know what it is.Archaic
@Archaic One last point related to your answer: "But the function bs() should not really be called without loading the package because functions like makepredictcall.bs don't be imported either so the custom dispatching for those objects would be broken." I believe this is inaccurate, when one calls a function using ::, the exported functions from that package are loaded, but not attached. (Thus the custom methods do work even without the package being attached.) That is why I was so surprised by the behaviour that inspired my question.Hairworm
B
1

It seems to be related to the boundary knot values in the 'predvars' attribute of the 'terms' part of splineModel.

If we call them splineModel_1 and splineModel_2

predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
6.969746

attr(splineModel_2[["terms"]], "predvars") <- attr(splineModel_1[["terms"]], "predvars")

predict(splineModel_1, newData)
16
predict(splineModel_2, newData)
16

attr(splineModel_1[["terms"]], "predvars")
list(y, bs(x, degree = 3L, knots = c(3, 6), Boundary.knots = c(0.323248628992587, 9.84225275926292), intercept = FALSE))

attr(splineModel_2[["terms"]], "predvars")
list(y, splines::bs(x, y, degree = 3, knots = c(3, 6)))

As you can see the difference is in the Boundary.knots. The only other difference is that the intercept defaults to FALSE so that's probably not relevant. The Boundary.knots are taken from the min and max of x. As for it being set by one version of bs and not another, I can only assume this is a relic in the code of lm that looks for 'bs' and not 'splines::bs' to set the Boundary.knots correctly.

Barleycorn answered 19/4, 2017 at 21:1 Comment(2)
Well spotted noticing the differing boundary knots, but with respect to "a relic in the code of lm that looks for 'bs' and not 'splines::bs' to set the Boundary.knots correctly", note that there is no other bs function in base R. Calling my first snipped without the splines library attached results in an error.Hairworm
@Hairworm Indeed. I'm speculating that there may have been a function called bs in the past (in base R or another common package) that made it a good idea to have splines::bs at the time.Barleycorn

© 2022 - 2024 — McMap. All rights reserved.