Block bootstrap from subject list
Asked Answered
D

5

11

I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.

I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.

  • Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
  • Using a loop and subset approach, like below, seems computationally burdensome.
  • Note that for my real data frame, n, and the number iterations is much larger than the example below.

My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use

sample(unique(df1$subject),n,replace=TRUE)

to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.

Example slow code:

require(plm)
data("Grunfeld", package="plm")

firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()

for(j in 1:iterations){

  v = sample(length(firms),n,replace=TRUE)
  newdata = NULL

  for(i in 1:n){
    newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
  }

  reg1 = lm(value ~ inv + capital, data = newdata)
  mybootresults[[j]] = coefficients(reg1)

}

mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults

  (Intercept)      inv    capital
1    373.8591 6.981309 -0.9801547
2    370.6743 6.633642 -1.4526338
3    528.8436 6.960226 -1.1597901
4    331.6979 6.239426 -1.0349230
5    507.7339 8.924227 -2.8661479
...
...
Discretionary answered 12/8, 2012 at 4:50 Comment(4)
Have you looked at the boot package?Display
Yes, I have looked at and used the boot package. I don't think it has the ability to do this type of block bootstrap, however.Discretionary
It's a bit of a fudge and probably overkill using boot, but I think the answer I posted does the job.Display
Note also that boot allows you to easily run regressions on multiple cores, which can provide significant speed improvements for large numbers of replicates.Siren
D
16

How about something like this:

myfit <- function(x, i) {
   mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
   coefficients(lm(value ~ inv + capital, data = mydata))
}

firms <- unique(Grunfeld$firm)

b0 <- boot(firms, myfit, 999)
Display answered 12/8, 2012 at 5:47 Comment(3)
This appears to do exactly what I want -- thanks and extra credit for utilizing the boot package. I have not used do.call before so this is helpful for that reason as well. Cheers-Discretionary
Do you know offhand how to deal with fixed effects (i.e. factor(region)) within lm() in the boot paradigm. Since some factors might not show up in each iteration, the number of coefficients differs and it errors out. Thoughts?Discretionary
Thanks, this was really helpful. Did you sort out the fixed-effects issue, by any chance? For this same reason, I'm having trouble comparing my target statistic across repetitions. My current strategy is to omit all FE coefficients in the vector I return in the 'myfit' function, but that seems a bit amateurish..Ona
A
5

You can also use the tsboot function in the boot package with fixed block resampling scheme.

require(plm)
require(boot)
data(Grunfeld)

### each firm is of length 20
table(Grunfeld$firm)
##  1  2  3  4  5  6  7  8  9 10 
## 20 20 20 20 20 20 20 20 20 20


blockboot <- function(data) 
{
 coefficients(lm(value ~ inv + capital, data = data))

}

### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")

boot.1    
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 410.81557 -25.785972    174.3766
## t2*   5.75981   0.451810      2.0261
## t3*  -0.61527   0.065322      0.6330

dim(boot.1$t)
## [1] 99  3

head(boot.1$t)
##        [,1]   [,2]      [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283  0.031324
## [3,] 479.74 3.2531  0.637298
## [4,] 557.79 4.5284  0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860
Anglomania answered 12/8, 2012 at 18:35 Comment(2)
l=20 in your tsboot() statement will inevitably take a block with say 15 observations of one firm and 5 observations of another (=20), which is not what I want to do. I either want to include all a firm's observations, or none, sometimes including the same firm's observations multiple times.Discretionary
@Discretionary No...because the first step of the algorithm is to split the dataset (of 200 rows) in 10 block of 20 rows (y_1, y_2, ..., y_10) and the resample the dataset by block e.g (y_1, y_10, y_3, ...., y_1), (y2, y_9, ..., y3) and so on. And because of the organisation of the Grunfeld dataset each y_* correspond to one firm (every 20 rows). For more info look at Bootstrap method and their Application (page 396).Anglomania
S
3

Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.

# get function for boot
myIndex <- function(x, i) {
  # select the observations to subset. Likely repeated observations
  blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
  # run regression for given replicate, return estimated coefficients
  coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}

now, bootstrap

# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)

Run the accepted answer

set.seed(1234)
b0 <- boot(firms, myfit, 200)

Let's eyeball a comparison

using indexing

b1

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = firms, statistic = myIndex, R = 200)


Bootstrap Statistics :
       original      bias    std. error
t1* 410.8155650 -6.64885086 197.3147581
t2*   5.7598070  0.37922066   2.4966872
t3*  -0.6152727 -0.04468225   0.8351341

Original version

b0

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = firms, statistic = myfit, R = 200)


Bootstrap Statistics :
       original      bias    std. error
t1* 410.8155650 -6.64885086 197.3147581
t2*   5.7598070  0.37922066   2.4966872
t3*  -0.6152727 -0.04468225   0.8351341

These look pretty close. Now, a bit more checking

identical(b0$t, b1$t)
[1] TRUE

and

identical(summary(b0), summary(b1))
[1] TRUE

Finally, we'll do a quick benchmark

library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)}, 
              rbind={b0 <- boot(firms, myfit, 200)})

On my computer, this returns

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
 index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866   100
 rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759   100

So, direct indexing is more than 2 times faster at every level of the distribution.

note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.

Siren answered 21/3, 2018 at 20:44 Comment(0)
C
2

The solution needs to be modified to manage fixed effects.

library(boot)  # for boot
library(plm)   # for Grunfeld
library(dplyr) # for left_join

## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")

## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)

## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
    ## join the sampled firms to their firm-year data
    mydata <- left_join(X[i,], Grunfeld, by="firm")
    ## Distinguish between multiple resamples of the same firm
    ## Otherwise they have the same id in the fixed effects regression
    ## And trouble ensues
    mydata  <- mutate(group_by(mydata,firm,year),
                      firm_uniq4boot = paste(firm,"+",row_number())
                      )
    ## Run regression with and without firm fixed effects
    c(coefficients(lm(value ~ inv + capital, data = mydata)),
    coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
    }

set.seed(1)
system.time(b <- boot(firms, myfit, 1000))

summary(b)

summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))
Casuist answered 26/4, 2017 at 23:10 Comment(1)
Note that uniq4boot is a unique value for each firm+i, and this example includes a separate intercept for each uniq4boot, but has a single slope coefficient across all uniq4boot.Laundes
L
1

I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.

library(boot)  # for boot
library(plm)   # for Grunfeld
library(dplyr) # for left_join

# First get the data
data("Grunfeld", package="plm")

firms <- unique(Grunfeld$firm)

myfit1 <- function(x, i) {
  # x is the vector of firms
  # i are the indexes into x
  mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
  coefficients(lm(value ~ inv + capital, data = mydata))
}

myfit2 <- function(x, i) {
  # x is the vector of firms
  # i are the indexes into x
  mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
  coefficients(lm(value ~ inv + capital, data = mydata))
}

# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
  ##  user  system elapsed 
  ## 13.51    0.01   13.62 

# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
   ## user  system elapsed 
   ## 8.16    0.02    8.26 

b1
##        original     bias    std. error
## t1* 410.8155650  9.2896499 198.6877889
## t2*   5.7598070  0.5748503   2.5725441
## t3*  -0.6152727 -0.1200954   0.7829191

b2
##        original     bias    std. error
## t1* 410.8155650  9.2896499 198.6877889
## t2*   5.7598070  0.5748503   2.5725441
## t3*  -0.6152727 -0.1200954   0.7829191
Laundes answered 30/11, 2016 at 1:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.