R bootstrap regression with facet_wrap
Asked Answered
T

3

10

Been practicing with the mtcars dataset.

I created this graph with a linear model.

library(tidyverse)
library(tidymodels)

ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + geom_smooth(method = 'lm')

enter image description here

Then I converted the dataframe to a long dataframe so I could try a facet_wrap.

mtcars_long_numeric <- mtcars %>%
  select(mpg, disp, hp, drat, wt, qsec) 

mtcars_long_numeric <- pivot_longer(mtcars_long_numeric, names_to = 'names', values_to = 'values', 2:6)

enter image description here

Now I wanted to learn something about the standard error on the geom_smooth and to see if I could generate a confidence interval using bootstrapping.

I found this code in the RStudio tidy models documentation at this link.

boots <- bootstraps(mtcars, times = 250, apparent = TRUE)
boots

fit_nls_on_bootstrap <- function(split) {
    lm(mpg ~ wt, analysis(split))
}

boot_models <-
  boots %>% 
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

boot_coefs <- 
  boot_models %>% 
  unnest(coef_info)

percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervals

ggplot(boot_coefs, aes(estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ term, scales = "free") +
  labs(title="", subtitle = "mpg ~ wt - Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
  geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "blue") +
  geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "blue")

boot_aug <- 
  boot_models %>% 
  sample_n(50) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)

ggplot(boot_aug, aes(wt, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") 

mtcars %>%
group_by(cyl) %>%
summarise(x = quantile(mpg, c(0.25, 0.75)), y = IQR(mpg)) %>%
filter(cyl == 8) %>%
mutate(z = x - y)

Is there some way to have the bootstrap regression as a facet_wrap also? I tried putting the long dataframe into the bootstraps function. .

boots <- bootstraps(mtcars_long_numeric, times = 250, apparent = TRUE)
boots

fit_nls_on_bootstrap <- function(split) {
    group_by(names) %>%
    lm(mpg ~ values, analysis(split))
}

But this doesn't work.

Or else I tried adding a group_by here:

boot_models <-
  boots %>% 
  group_by(names) %>%
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

This doesn't work because boots$names doesn't exist. I can't add a grouping as a facet_wrap in boot_aug either because names doesn't exist there.

ggplot(boot_aug, aes(values, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
    facet_wrap(~names) +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") 

Also, I've learned that I can't facet_wrap by ~id, either. I end up with a graph that looks like this and it's pretty unreadable! I'm really interested in using facet_wrap on things like 'wt', 'disp', 'qsec' and not on each bootstrap itself.

enter image description here

This is one of those cases where I'm using code a little above my weight - would appreciate any advice.

This is the image that I would like to have as expected output. Except instead of the shaded area for the standard error, I would like to see bootstrapped regression models that take up the same area, more or less.

enter image description here

Tarpley answered 13/7, 2021 at 0:6 Comment(4)
Can you share an image of what you want your output to look like? Thanks!Boley
I added a final short paragraph and an image. Hope this clarifies. I'm trying to facet_wrap linear models and, instead of leaning on the standard error with the shaded area in the geom_smooth(se = TRUE) part of ggplot, I would like the area to be filled with the results of bootstrapped regressions. You can see in my post that the first graph becomes one of the sections of the second graph. I'm hoping the third graph can become one of the sections of a fourth graph I'm not able to create!Tarpley
Is your question specifically about using facet_wrap(), or would something like ggarrange() work? The latter would be simpler to implement, but much less compact and integrated. I can post an example using of a non-facet approach as an answer if you'd like.Boley
I appreciate the idea. I'm really trying to work through tidyverse and tidymodel methods to understand them, so I would prefer that structure as a solution although maybe it's not exactly possible here :/Tarpley
L
5

Perhaps something like this:

library(data.table)
mt = as.data.table(mtcars_long_numeric)

# helper function to return lm coefficients as a list
lm_coeffs = function(x, y) {
  coeffs = as.list(coefficients(lm(y~x)))
  names(coeffs) = c('C', "m")
  coeffs
}

# generate bootstrap samples of slope ('m') and intercept ('C')
nboot = 100
mtboot = lapply (seq_len(nboot), function(i) 
  mt[sample(.N,.N,TRUE), lm_coeffs(values, mpg), by=names])
mtboot = rbindlist(mtboot)

# and plot:    
ggplot(mt, aes(values, mpg)) +
  geom_abline(aes(intercept=C, slope=m), data = mtboot, size=0.3, alpha=0.3, color='forestgreen') +
  stat_smooth(method = "lm", colour = "red", geom = "ribbon", fill = NA, size=0.5, linetype='dashed') +
  geom_point() +
  facet_wrap(~names, scales = 'free_x')

enter image description here

P.S for those who prefer dplyr (not me), here's the same logic converted to that format:

lm_coeffs = function(x, y) {
  coeffs = coefficients(lm(y~x))
  tibble(C = coeffs[1], m=coeffs[2])
}

mtboot = lapply (seq_len(nboot), function(i) 
  mtcars_long_numeric %>%
    group_by(names) %>%
    slice_sample(prop=1, replace=TRUE) %>% 
    summarise(tibble(lm_coeffs2(values, mpg))))
mtboot = do.call(rbind, mtboot)
Latterday answered 21/7, 2021 at 17:24 Comment(2)
oh wow what is this part doing? mt[sample(.N,.N,TRUE), lm_coeffs(values, mpg), by=names][,i:=I]) what is ".N" here? I'm having a hard time seeing what that is doingTarpley
.N is the number of rows in the group. It is the data.table equivalent of n() in dplyr, This line takes a .N length sample of these rows with replacement - which is the definition of how bootstrapping works.Latterday
S
5

Something like this might work if you want to stick with the tidyverse:

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(broom)

set.seed(42)

n_boot <- 1000

mtcars %>% 
  select(-c(cyl, vs:carb)) %>% 
  pivot_longer(-mpg) -> tbl_mtcars_long

tbl_mtcars_long %>% 
  nest(model_data = c(mpg, value)) %>% 
  # for mpg and value observations within each level of name (e.g., disp, hp, ...)
  mutate(plot_data = map(model_data, ~ {
    # calculate information about the observed mpg and value observations
    # within each level of name to be used in each bootstrap sample
    submodel_data <- .x
    n <- nrow(submodel_data)
    min_x <- min(submodel_data$value)
    max_x <- max(submodel_data$value)
    pred_x <- seq(min_x, max_x, length.out = 100)
    
    # do the bootstrapping by
    # 1) repeatedly sampling samples of size n with replacement n_boot times,
    # 2) for each bootstrap sample, fit a model, 
    # 3) and make a tibble of predictions
    # the _dfr means to stack each tibble of predictions on top of one another
    map_dfr(1:n_boot, ~ {
      submodel_data %>% 
        sample_n(n, TRUE) %>% 
        lm(mpg ~ value, .) %>% 
        # suppress augment() warnings about dropping columns
        { suppressWarnings(augment(., newdata = tibble(value = pred_x))) }
    }) %>% 
      # the bootstrapping is finished at this point
      # now work across bootstrap samples at each value
      group_by(value) %>% 
      # to estimate the lower and upper 95% quantiles of predicted mpgs
      summarize(l = quantile(.fitted, .025),
                u = quantile(.fitted, .975),
                .groups = "drop"
      ) %>% 
      arrange(value)
  })) %>% 
  select(-model_data) %>% 
  unnest(plot_data) -> tbl_plot_data

ggplot() + 
  # observed points, model, and se
  geom_point(aes(value, mpg), tbl_mtcars_long) + 
  geom_smooth(aes(value, mpg), tbl_mtcars_long, 
              method = "lm", formula = "y ~ x", alpha = 0.25, fill = "red") + 
  # overlay bootstrapped se for direct comparison
  geom_ribbon(aes(value, ymin = l, ymax = u), tbl_plot_data, 
              alpha = 0.25, fill = "blue") + 
  facet_wrap(~ name, scales = "free_x")

Created on 2021-07-19 by the reprex package (v1.0.0)

Spathose answered 19/7, 2021 at 21:14 Comment(2)
Thanks for the answer on this. I think the other answer compiled a little faster and was somehow a little easier for me to understand, and also graphed in a bit more clear a way all the different bootstrap regressions. But I think this code was really impressive and got the job doneTarpley
Yes, I was impressed with the compactness of the @Latterday answer. The code can be considerably shortened by only plotting the bootstrap models and not estimating the CIs.Spathose
L
5

Perhaps something like this:

library(data.table)
mt = as.data.table(mtcars_long_numeric)

# helper function to return lm coefficients as a list
lm_coeffs = function(x, y) {
  coeffs = as.list(coefficients(lm(y~x)))
  names(coeffs) = c('C', "m")
  coeffs
}

# generate bootstrap samples of slope ('m') and intercept ('C')
nboot = 100
mtboot = lapply (seq_len(nboot), function(i) 
  mt[sample(.N,.N,TRUE), lm_coeffs(values, mpg), by=names])
mtboot = rbindlist(mtboot)

# and plot:    
ggplot(mt, aes(values, mpg)) +
  geom_abline(aes(intercept=C, slope=m), data = mtboot, size=0.3, alpha=0.3, color='forestgreen') +
  stat_smooth(method = "lm", colour = "red", geom = "ribbon", fill = NA, size=0.5, linetype='dashed') +
  geom_point() +
  facet_wrap(~names, scales = 'free_x')

enter image description here

P.S for those who prefer dplyr (not me), here's the same logic converted to that format:

lm_coeffs = function(x, y) {
  coeffs = coefficients(lm(y~x))
  tibble(C = coeffs[1], m=coeffs[2])
}

mtboot = lapply (seq_len(nboot), function(i) 
  mtcars_long_numeric %>%
    group_by(names) %>%
    slice_sample(prop=1, replace=TRUE) %>% 
    summarise(tibble(lm_coeffs2(values, mpg))))
mtboot = do.call(rbind, mtboot)
Latterday answered 21/7, 2021 at 17:24 Comment(2)
oh wow what is this part doing? mt[sample(.N,.N,TRUE), lm_coeffs(values, mpg), by=names][,i:=I]) what is ".N" here? I'm having a hard time seeing what that is doingTarpley
.N is the number of rows in the group. It is the data.table equivalent of n() in dplyr, This line takes a .N length sample of these rows with replacement - which is the definition of how bootstrapping works.Latterday
T
2

I think I figured out an answer to this problem. I would still like help figuring out how to make this code much tighter - you can see I've duplicated a lot.

mtcars_mpg_wt <- mtcars %>%
  select(mpg, wt)

boots <- bootstraps(mtcars_mpg_wt, times = 250, apparent = TRUE)
boots

# glimpse(boots) 
# dim(mtcars)

fit_nls_on_bootstrap <- function(split) {
    lm(mpg ~ wt, analysis(split))
}

library(purrr)

boot_models <-
  boots %>% 
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

boot_coefs <- 
  boot_models %>% 
  unnest(coef_info)

percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervals

ggplot(boot_coefs, aes(estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ term, scales = "free") +
  labs(title="", subtitle = "mpg ~ wt - Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
  geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "blue") +
  geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "blue")




boot_aug <- 
  boot_models %>% 
  sample_n(50) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)

# boot_aug <- 
#   boot_models %>% 
#   sample_n(200) %>% 
#   mutate(augmented = map(model, augment)) %>% 
#   unnest(augmented)

# boot_aug
glimpse(boot_aug)


ggplot(boot_aug, aes(wt, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") 


boot_aug_1 <- boot_aug %>%
  mutate(factor = "wt")






mtcars_mpg_disp <- mtcars %>%
  select(mpg, disp)

boots <- bootstraps(mtcars_mpg_disp, times = 250, apparent = TRUE)
boots

# glimpse(boots) 
# dim(mtcars)

fit_nls_on_bootstrap <- function(split) {
    lm(mpg ~ disp, analysis(split))
}

library(purrr)

boot_models <-
  boots %>% 
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

boot_coefs <- 
  boot_models %>% 
  unnest(coef_info)

percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervals

ggplot(boot_coefs, aes(estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ term, scales = "free") +
  labs(title="", subtitle = "mpg ~ wt - Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
  geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "blue") +
  geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "blue")




boot_aug <- 
  boot_models %>% 
  sample_n(50) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)

# boot_aug <- 
#   boot_models %>% 
#   sample_n(200) %>% 
#   mutate(augmented = map(model, augment)) %>% 
#   unnest(augmented)

# boot_aug
glimpse(boot_aug)


ggplot(boot_aug, aes(disp, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = "mpg ~ disp \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") 



boot_aug_2 <- boot_aug %>%
  mutate(factor = "disp")




mtcars_mpg_drat <- mtcars %>%
  select(mpg, drat)

boots <- bootstraps(mtcars_mpg_drat, times = 250, apparent = TRUE)
boots

# glimpse(boots) 
# dim(mtcars)

fit_nls_on_bootstrap <- function(split) {
    lm(mpg ~ drat, analysis(split))
}

library(purrr)

boot_models <-
  boots %>% 
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

boot_coefs <- 
  boot_models %>% 
  unnest(coef_info)

percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervals

ggplot(boot_coefs, aes(estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ term, scales = "free") +
  labs(title="", subtitle = "mpg ~ wt - Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
  geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "blue") +
  geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "blue")




boot_aug <- 
  boot_models %>% 
  sample_n(50) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)

# boot_aug <- 
#   boot_models %>% 
#   sample_n(200) %>% 
#   mutate(augmented = map(model, augment)) %>% 
#   unnest(augmented)

# boot_aug
glimpse(boot_aug)


ggplot(boot_aug, aes(drat, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") 


boot_aug_3 <- boot_aug %>%
  mutate(factor = "drat")








mtcars_mpg_hp <- mtcars %>%
  select(mpg, hp)

boots <- bootstraps(mtcars_mpg_hp, times = 250, apparent = TRUE)
boots

# glimpse(boots) 
# dim(mtcars)

fit_nls_on_bootstrap <- function(split) {
    lm(mpg ~ hp, analysis(split))
}

library(purrr)

boot_models <-
  boots %>% 
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

boot_coefs <- 
  boot_models %>% 
  unnest(coef_info)

percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervals

ggplot(boot_coefs, aes(estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ term, scales = "free") +
  labs(title="", subtitle = "mpg ~ wt - Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
  geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "blue") +
  geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "blue")




boot_aug <- 
  boot_models %>% 
  sample_n(50) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)

# boot_aug <- 
#   boot_models %>% 
#   sample_n(200) %>% 
#   mutate(augmented = map(model, augment)) %>% 
#   unnest(augmented)

# boot_aug
glimpse(boot_aug)


ggplot(boot_aug, aes(hp, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") 


boot_aug_4 <- boot_aug %>%
  mutate(factor = "hp")











mtcars_mpg_qsec <- mtcars %>%
  select(mpg, qsec)

boots <- bootstraps(mtcars_mpg_qsec, times = 250, apparent = TRUE)
boots

# glimpse(boots) 
# dim(mtcars)

fit_nls_on_bootstrap <- function(split) {
    lm(mpg ~ qsec, analysis(split))
}

library(purrr)

boot_models <-
  boots %>% 
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

boot_coefs <- 
  boot_models %>% 
  unnest(coef_info)

percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervals

ggplot(boot_coefs, aes(estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ term, scales = "free") +
  labs(title="", subtitle = "mpg ~ wt - Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
  geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "blue") +
  geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "blue")




boot_aug <- 
  boot_models %>% 
  sample_n(50) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)

# boot_aug <- 
#   boot_models %>% 
#   sample_n(200) %>% 
#   mutate(augmented = map(model, augment)) %>% 
#   unnest(augmented)

# boot_aug
glimpse(boot_aug)


ggplot(boot_aug, aes(qsec, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") 


boot_aug_5 <- boot_aug %>%
  mutate(factor = "qsec")


boot_aug_total <- bind_rows(boot_aug_1, boot_aug_2, boot_aug_3, boot_aug_4, boot_aug_5)

boot_aug_total <- boot_aug_total %>%
  select(disp, drat, hp, qsec, wt, mpg, .fitted, id, factor)

boot_aug_total_2 <- pivot_longer(boot_aug_total, names_to = 'names', values_to = 'values', 1:5)

boot_aug_total_2 <- boot_aug_total_2 %>%
  drop_na()
  


ggplot(boot_aug_total_2, aes(values, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "blue") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    labs(title="", subtitle = " \n Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "coefficient stability testing") +
  facet_wrap(~factor, scales = 'free')

enter image description here

vs

enter image description here

Tarpley answered 16/7, 2021 at 19:13 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.