Iterate over list of models and compare model fit using AIC, BIC
Asked Answered
F

4

5

I have dataset with multiple outcome variables that I would like to test against one predictor. on exploratory analysis I noted some of the relationships are polynomial to the degree 2 rather than linear. I would like to look at the BIC and AIC to make my decision of which is the best model to run.

I have lapply function where I can iterate over multiple outcome variables but now I would like to add a second model and compare their fit. However when I run this function, it only saves the second model and I dont know how to get to 'outputs' to run through the next function. Can I have this within one function or do I need two?

Here is a example from the iris dataset

data(iris)
vars <- names(iris[2:4])

models2 <- lapply(vars, function(x) {
  model_list=list(
    mod1=lm(substitute(i ~ Sepal.Length, list(i=as.name(x))), data=iris),
    mod2=lm(substitute(i ~ poly(Sepal.Length,2), list(i=as.name(x))), data=iris))
})

y <- lapply(models2, summary) #This only saves results from mod2

How do I then compare mod1 to mod2 fit and extract the following variable's?

data.frame(
  do.call(merge, list(BIC(mod1, mod2), AIC(mod1, mod2))), 
  logLik=sapply(list(mod1, mod2), logLik), 
  anova(mod1, mod2, test='Chisq'))
Funnyman answered 21/5, 2024 at 5:42 Comment(1)
If you have lm models in a list check out compare_performance() in library(performance). Does a lot of the hard work in one line.Griffey
A
3

First, make the lm nicer using do.call and reformulate. Then lapply over the models like this:

> models2 <- lapply(setNames(vars, vars), function(x) {
+   list(
+     mod1=do.call('lm', list(reformulate('Sepal.Length', x), quote(iris))),
+     mod2=do.call('lm', list(reformulate('poly(Sepal.Length, 2)', x), quote(iris)))
+   )
+ })
> 
> (res <- lapply(models2, \(x) data.frame(
+   with(x, do.call('merge', list(BIC(mod1, mod2), AIC(mod1, mod2)))),
+   logLik=with(x, sapply(list(mod1, mod2), logLik)),
+   with(x, anova(mod1, mod2))
+ )))
$Sepal.Width
  df      BIC      AIC    logLik Res.Df      RSS Df Sum.of.Sq        F     Pr..F.
1  3 188.4963 179.4644 -86.73221    148 27.91566 NA        NA       NA         NA
2  4 189.8107 177.7682 -84.88410    147 27.23618  1 0.6794752 3.667285 0.05743267

$Petal.Length
  df      BIC      AIC    logLik Res.Df      RSS Df Sum.of.Sq        F      Pr..F.
1  3 396.1669 387.1350 -190.5675    148 111.4592 NA        NA       NA          NA
2  4 389.8649 377.8223 -184.9112    147 103.3623  1  8.096848 11.51519 0.000887343

$Petal.Width
  df      BIC      AIC    logLik Res.Df      RSS Df Sum.of.Sq        F      Pr..F.
1  3 192.4030 183.3711 -88.68553    148 28.65225 NA        NA       NA          NA
2  4 182.3757 170.3331 -81.16656    147 25.91907  1  2.733179 15.50122 0.000126881

You could additionally rbind.

> do.call('rbind', res)
               df      BIC      AIC     logLik Res.Df       RSS Df Sum.of.Sq         F      Pr..F.
Sepal.Width.1   3 188.4963 179.4644  -86.73221    148  27.91566 NA        NA        NA          NA
Sepal.Width.2   4 189.8107 177.7682  -84.88410    147  27.23618  1 0.6794752  3.667285 0.057432671
Petal.Length.1  3 396.1669 387.1350 -190.56750    148 111.45916 NA        NA        NA          NA
Petal.Length.2  4 389.8649 377.8223 -184.91116    147 103.36231  1 8.0968484 11.515191 0.000887343
Petal.Width.1   3 192.4030 183.3711  -88.68553    148  28.65225 NA        NA        NA          NA
Petal.Width.2   4 182.3757 170.3331  -81.16656    147  25.91907  1 2.7331790 15.501223 0.000126881

Data:

> data(iris)
> vars <- names(iris[2:4])
Agrimony answered 21/5, 2024 at 14:59 Comment(0)
P
2

Something like this?
Compute the statistics one by one and cbind them Then pipe to do.call/rbind to put the results in one dataframe.

data(iris)
vars <- names(iris[2:4])

models2 <- lapply(vars, function(x) {
  model_list=list(
    mod1<-lm(substitute(i ~ Sepal.Length , list(i = as.name(x))), data = iris),
    mod2<-lm(substitute(i ~ poly(Sepal.Length,2), list(i = as.name(x))), data = iris))
})

results <- lapply(models2, \(m) {
  mod1 <- m[[1L]]
  mod2 <- m[[2L]]
  AIC <- AIC(mod1, mod2)
  BIC <- BIC(mod1, mod2)[2L]
  logLik <- sapply(m, logLik)
  anova <- anova(mod1, mod2, test = 'Chisq')
  cbind(AIC, BIC, logLik, anova)
}) |>
  do.call(rbind, args = _)

results
#>       df      AIC      BIC     logLik Res.Df       RSS Df Sum of Sq
#> mod1   3 179.4644 188.4963  -86.73221    148  27.91566 NA        NA
#> mod2   4 177.7682 189.8107  -84.88410    147  27.23618  1 0.6794752
#> mod11  3 387.1350 396.1669 -190.56750    148 111.45916 NA        NA
#> mod21  4 377.8223 389.8649 -184.91116    147 103.36231  1 8.0968484
#> mod12  3 183.3711 192.4030  -88.68553    148  28.65225 NA        NA
#> mod22  4 170.3331 182.3757  -81.16656    147  25.91907  1 2.7331790
#>           Pr(>Chi)
#> mod1            NA
#> mod2  5.549049e-02
#> mod11           NA
#> mod21 6.902974e-04
#> mod12           NA
#> mod22 8.245189e-05

Created on 2024-05-21 with reprex v2.1.0

Patricio answered 21/5, 2024 at 12:10 Comment(4)
What do these p-values indicate?Carinacarinate
@Carinacarinate The anovas test for the importance of a 2nd order term in the presence of an intercept and a 1st order term.Patricio
It will be great if variable names are there in place of mod...Carinacarinate
@Carinacarinate How come? AIC, BIC, log-likelihood and ANOVA are for the models, not for variables one by one.Patricio
E
2

In the question models2 is a list of lists so the summary there is trying to take the summary of lists rather than taking the summary of lm objects. Using models2 we can rewrite the code that uses it like this. Note that broom::glance produces a one row tibble with numerous statistics and by changing stats we can select which ones we wish to retain.

library(broom)
library(dplyr)
library(purrr)

stats <- c("AIC", "BIC", "logLik")
pval <- function(x, y) anova(x, y, test = "Chisq")$"Pr(>Chi)"[2]

models2 %>%
 setNames(vars) %>%
  map_dfr(with, c(m1 = glance(mod1)[stats], m2 = glance(mod2)[stats],
    pval = pval(mod1, mod2)), .id = "response") 

giving

# A tibble: 3 × 8
  response     m1.AIC m1.BIC m1.logLik m2.AIC m2.BIC m2.logLik      pval
  <chr>         <dbl>  <dbl>     <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
1 Sepal.Width    179.   188.     -86.7   178.   190.     -84.9 0.0555   
2 Petal.Length   387.   396.    -191.    378.   390.    -185.  0.000690 
3 Petal.Width    183.   192.     -88.7   170.   182.     -81.2 0.0000825

These are the statistics that glance produces. Here statistic means F statistic.

names(glance(lm(demand ~., BOD)))
##  [1] "r.squared"     "adj.r.squared" "sigma"         "statistic"    
##  [5] "p.value"       "df"            "logLik"        "AIC"          
##  [9] "BIC"           "deviance"      "df.residual"   "nobs"        

Note that models2 in the question is fine but it could be written like this. We first define fm which is mod1 and then simply update it to get mod2.

models2 <- lapply(vars, function(x) {
  fm <- do.call("lm", list(reformulate("Sepal.Length", x), quote(iris)))
  list(mod1 = fm,
       mod2 = update(fm, . ~ poly(Sepal.Length, 2)))
})
Elburt answered 21/5, 2024 at 18:53 Comment(1)
Thanks for the suggestion of the models2 code this is a lot cleaner code. Your code for getting the summary of AIC etc.. worked well as well.Funnyman
K
1

Your question is really about how to iterate over lists.

summaries <- sapply(models2, lapply, summary)
colnames(summaries) <- vars
rownames(summaries) <- c("degree = 1", "degree = 2")
#example:
summaries["degree = 2", "Petal.Length"]
#shows the summary

lapply(models2, \(x) {
  res <- AIC(x[[1]], x[[2]])
  rownames(res) <- c("degree = 1", "degree = 2")
  res
  })

#[[1]]
#           df      AIC
#degree = 1  3 179.4644
#degree = 2  4 177.7682
#
#[[2]]
#           df      AIC
#degree = 1  3 387.1350
#degree = 2  4 377.8223
#
#[[3]]
#           df      AIC
#degree = 1  3 183.3711
#degree = 2  4 170.3331
Khajeh answered 21/5, 2024 at 12:1 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.