Individual terms in prediction of linear regression

Asked 17/12, 2017 at 9:35 Answered 27/5, 2022 at 17:51

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.

So something like this:

set.seed(123)                                              
y <- rnorm(10)                                           
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)  
summary(regr)
terms <- predict.lm(regr,m, type="terms")

In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:

coefficients(regr)[2]*m$v1

But that gives different results compared to the predict function.

Own calculation:

0.55293884  0.16253411  0.18103537  0.04999729 -0.25108302  0.80717945  0.22488764 -0.88835486  0.31681455 -0.21356803

And predict function calculation:

0.45870070  0.06829597  0.08679724 -0.04424084 -0.34532115  0.71294132  0.13064950 -0.98259299  0.22257641 -0.30780616

The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?

Finzer answered 17/12, 2017 at 9:35 Comment(2)

With type = 'response", it should be (model.matrix(~v1 + v2 + v3, m) %*% coefficients(regr))[,1] giving the same as predict.lm(regr, m, type = "response") From the terms, you can do rowSums(terms) + attr(terms, "constant") – Foreboding 17/12, 2017 at 10:11

Does this answer your question? Prediction Components in R Linear Regression – Inquisitorial 28/5, 2022 at 20:16

All the following lines result in the same predictions:

# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
  coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3

# prediction using predict function
predict.lm(regr,m)

# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')

You can read more about using type="terms" here.

The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:

# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)

# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)

Hope this helps.

Mendacious answered 17/12, 2017 at 10:12 Comment(1)

Thanks! Didn't realized the constant also was different, so that's why the individual terms didn't add up to the total prediction (if you use the constant from the coefficients and not the terms). Still a bit incovenient the terms are centred around the mean (anyone know of a package that does differently? I cannot find it) but I can work woith this:) – Finzer 19/12, 2017 at 9:7

The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.

TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))


library(tidyverse)

### simulate data

set.seed(123)

nobs <- 50

x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)

y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)

df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)

train <- 1:round(0.7*nobs,0)

rm(x1, x2, y)

trainData <- df[train,]
testData <- df[-train,]

### linear model

mod <- lm(y ~ x1 + x2 , data=trainData)

summary(mod)


### predict test set

test_preds <- predict(mod, newdata=testData)

head(test_preds)

### contribution by predictor

test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))

colnames(test_contribution) <- names(coef(mod))

head(test_contribution)

all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5))  ## should be true

### Visualize each contribution

test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]

test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")

names(test_contribution_df_long)

ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
  geom_line() +
  theme_bw()

Inquisitorial answered 27/5, 2022 at 17:51 Comment(0)

Recommended topics

Hot tags