How to Loop/Repeat a Linear Regression in R
Asked Answered
S

5

16

I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.

Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.

I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.

Any help would be appreciated!

Stroman answered 14/1, 2015 at 21:19 Comment(0)
D
26

You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.

set.seed(1)

# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5 

# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender  <- factor(rbinom(obs, 1, .5))

# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))

# extract just coefficients
sapply(my_lms, coef)

# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared, 
                                adj_r_sq = x$adj.r.squared))

The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.

Dogwatch answered 14/1, 2015 at 22:53 Comment(9)
This is great! Can I ask how you're coding age and gender into the equation? The way I did it before was to make a table containing age/gender, and set AgeGender <- read.csv("AgeGender.csv", header=TRUE). I then set Age <- AgeGender[,1] and Gender <- AgeGender[,2]. Gender was also coded as 1 for male and 0 for female, so should I be changing 0.5 in your Gender function to 0?Stroman
Strike that, it was actually much less complicated than I was making it! It seems to be working, thank you so much.Stroman
Glad it was helpful. Feel free to 'accept' the answer! :) Remember you can get info on any function by running ?function_name at the console (e.g. ?rbinom)Dogwatch
Just accepted it, thanks for the tip. You wouldn't happen to have any additional advice for extracting the p-values and R-squared using this function, would you?Stroman
take a look at ?summary.lmDogwatch
To try to get R squared, I've tried using summary(lm(Lung[,x] ~ Blood[,x] + Age + Gender))$r.squared, but it keeps saying "incorrect number of dimensions". Any recommendations?Stroman
@JHall1020 I know it's been ages, but have you figured a way to find p.values for coefs and r-squared?Protestantism
Hi, browsing this because I have a similar problem to OP. Why is it that you need to set the seed here? What is random?Twice
just so the dummy data section is reproducible (the rnorm() calls )Dogwatch
H
3

The question seems to be about how to call regression functions with formulas which are modified inside a loop.

Here is how you can do it in (using diamonds dataset):

attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)

formula <- list(); model <- list()
for (i in 1:1) {
  formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
  model[[i]] = glm(formula[[i]]) 

  #then you can plot or do anything else with the result ...
  png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
  par(mfrow = c(2, 2))      
  plot(model[[i]])
  dev.off()
  }
Homogeny answered 28/3, 2017 at 17:22 Comment(0)
F
2

Sensible or not, to make the loop at least somehow work you need:

y<- c(1,5,6,2,5,10) # response 
x1<- c(2,12,8,1,16,17) # predictor 
x2<- c(2,14,5,1,17,17) 
predictorlist<- list("x1","x2") 
for (i in predictorlist){ 
  model <- lm(paste("y ~", i[[1]]), data=df) 
  print(summary(model)) 
} 

The paste function will solve the problem.

Fruiter answered 8/1, 2019 at 12:18 Comment(1)
The vars (y/x1/x2) need to be stored as data frame for the example to work, I think: df <- data.frame( y = c(1,5,6,2,5,10), x1 = c(2,12,8,1,16,17), x2 = c(2,14,5,1,17,17) )Jackleg
J
2

A tidyverse addition - with map()

Another way - using map2() from the purrr package:

library(purrr)

xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]

map2_df(ys, xs,
        function(i,j){
          m <- lm(i ~j + x4 , data = anscombe)
          coef(m)
        })

The output is a dataframe (tibble) of all coefficients:

  `(Intercept)`     j      x4
1          4.33 0.451 -0.0987
2          6.42 0.373 -0.253 
3          2.30 0.526  0.0518

If more variables are changing this can be done using the pmap() functions

Jackleg answered 26/7, 2021 at 16:49 Comment(0)
D
0

The following approach with work with a multivariate model, where you have multiple outcomes and predictors. I will use some sample data to illustrate the idea and how it works.

df <- data.frame(y1=sample(1:5, size=50, replace=TRUE),
                 y2=sample(1:5, size=50, replace=TRUE),
                 x1=sample(1:5, size=50, replace=TRUE),
                 x2=sample(1:5, size=50, replace=TRUE),
                 x3=sample(1:5, size=50, replace=TRUE),
                 x4=sample(1L:2L, size=50, replace=TRUE))
df

The function requires a named argument for the dv, but uses ellipsis to indicate that you could have any number of predictors. Inside the function, you use deparse() and substitute() on the predictors and pass them on to the reformulate() function along with the dv. I included dv=dv inside the function so that one can see which dv the model output is associated with.

# custom lm function
lm_func <- function(dv, ...){
  x = sapply(substitute(...()), deparse)
  f = reformulate(termlabels=x, response=dv)
  model = eval(lm(f, data=df))
  list(dv=dv, model_summary=summary(model))
}

In the next step, one selects the dvs from the target dataframe, and names them.

# select the dvs and set names
dvs <- names(df)[1:2]
dvs <- purrr::set_names(dvs)

Finally, run a loop over the dvs and store the results.

# run a for loop and save the output for each loop
lm_out = list()
for (i in 1:length(dvs)){
  lm_out[[i]] = (lm_func(dvs[i], x1, x2))
  }
lm_out

Note: one can do some more stuff in the lm_func; for example, in terms of which parts of the model summary to extract.

Digestive answered 26/2, 2023 at 7:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.