R: Remove nested for loops in order to make a custom bootstrap more efficient
Asked Answered
P

2

6

I am trying to gather some bootstrapped estimates for summary statistics from a dataset, but I want to resample parts of the dataset at different rates, which has led me to lean on nested for loops.

Specifically, suppose there are two groups in my dataset, and each group is further divided into test and control. Group 1 has a 75% / 25% test-control ratio, and Group 2 has a 50% / 50% test-control ratio.

I want to resample such that the dataset is the same size, but the test-control ratios are 90% / 10% for both groups... in other words, resample different subgroups at different rates, which strikes me as different from what the boot package normally does.

In my dataset, I created a group variable representing the groups, and a groupT variable representing group concatenated with test/control, e.g.:

    id     group     groupT
     1         1         1T
     2         1         1T
     3         2         2T
     4         1         1C
     5         2         2C

Here's what I am running right now, with nreps arbitrarily set to be my number of bootstrap replications:

for (j in 1:nreps){

  bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset

  for (i in unique(datafile$groups)){

    tstring<-paste0(i,"T") ## e.g. 1T
    cstring<-paste0(i,"C") ## e.g. 1C

    ## Size of test group resample should be ~90% of total group size

    tsize<-round(.90*length(which(datafile$groups==i)),0)

    ## Size of control group resample should be total group size minus test group size

    csize<-length(which(datafile$groups==i))-tsize

    ## Continue building bootdat by rbinding the test and control resample

    ## before moving on to the next group
    ## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.

    bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
      replace=TRUE),])

    bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
      replace=TRUE),])
  }

  ## Here, there is code to grab some summary statistics from bootdat
  ## and store them in statVector[j] before moving on to the next replication
}

With a dataset size of about 1 million total records, this takes 3-4 minutes per replication. I feel certain there is a better way to do this either with sapply or possibly some of the dplyr functions, but I have come up empty in my attempts so far. Any help would be appreciated!

Publican answered 14/3, 2018 at 12:19 Comment(2)
Is unique(datafile$groups) just 1:2, or at least a constant for every j? The initialize empty dataset line doesn't seem to do anything either. Even so I'm surprised that this part of the code takes more than a few seconds for each replication, let alone > 3 minutes. On my machine with some dummy data it executes in 0.02s.Outspeak
There are actually up to 32 groups, rather than just 2... wanted to keep the example simpler. I am initializing the empty dataset so I can get the same structure as the dataset (columns, etc.) and then rbind each successive group as it's sampled -- not sure whether I need to initialize it that way or not.Publican
R
2

I'd strongly encourage you to look into data.table and foreach, using keyed searches for bootstraps. It'll allow you to do a single bootstrap very rapidly, and you can run each bootstrap independently on a different core. Each bootstrap of the below takes 0.5 seconds on my machine, searching through a table of 1 million rows. Something like the following should get you started:

library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)

# example data
    dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))


# define number of bootstraps
    nBootstraps <- 1000

# define sampling fractions
    fraction_test <- 0.90
    fraction_control <- 1 - fraction_test

# get number that you want to sample from each group
    N.test <- round(fraction_test * dim(dat)[1])
    N.control <- round(fraction_control * dim(dat)[1])

# key data by id
    setkey(dat, id)

# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {

    # sample each group with the defined sizes, with replacement
    g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
    g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
    g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
    g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
    dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
    dat.all[, bootstrap := n]

    # do summary stats here with dat.all, return the summary stats data.table object
        return(dat.summarized)

}

EDIT: example below includes a lookup table for each of any arbitrary number of unique groups. The IDs corresponding to each combination of group + (test OR control) can be referenced within a foreach loop for simplicity. With lower numbers for N.test and N.control (900 and 100) it spits out the results of 1000 bootstraps in

library(data.table)
library(foreach)

# example data
    dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))

# save vector of all group values & change group to character vector for hashed environment lookup
    all_groups <- as.character(sort(unique(dat$group)))
    dat[, group := as.character(group)]


# define number of bootstraps
    nBootstraps <- 100

# get number that you want to sample from each group
    N.test <- 900
    N.control <- 100

# key data by id
    setkey(dat, id)

# all values for group

# Set up lookup table for every combination of group + test/control
    control.ids <- new.env()
    test.ids <- new.env()

    for(i in all_groups) {
        control.ids[[i]] <- dat[group==i & test_control=="C"]$id
        test.ids[[i]] <- dat[group==i & test_control=="T"]$id
    }


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
    foreach(group.i = all_groups, .combine="rbind") %do% {

        # get IDs that correspond to this group, for both test and control
        control_id_vector <- control.ids[[group.i]]
        test_id_vector <- test.ids[[group.i]]

        # search and bind
        controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
        tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
        dat.group <- rbindlist(list(controls, tests))
        dat.group[, bootstrap := n]
        return(dat.group[])
    }
    # summarize across all groups for this bootstrap and return summary stat data.table object

}

yielding

> results
              id group test_control bootstrap
       1: 701570     1            C         1
       2: 424018     1            C         1
       3: 909932     1            C         1
       4:  15354     1            C         1
       5: 514882     1            C         1
      ---
23999996: 898651    24            T      1000
23999997: 482374    24            T      1000
23999998: 845577    24            T      1000
23999999: 862359    24            T      1000
24000000: 602078    24            T      1000

This doesn't involve any of the summary stat calculation time, but here 1000 bootstraps were pulled out on 1 core serially in

   user  system elapsed
 62.574   1.267  63.844

If you need to manually code N to be different for each group, you can do the same thing as with id lookup

# create environments
control.Ns <- new.env()
test.Ns <- new.env()

# assign size values
control.Ns[["1"]]   <- 900
   test.Ns[["1"]]   <- 100
control.Ns[["2"]]   <- 400
   test.Ns[["2"]]   <- 50
    ...             ...
control.Ns[["24"]]   <- 200
   test.Ns[["24"]]   <- 5

then change the big bootstrap loop to look up these values based on the loop's current group:

results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
    foreach(group.i = all_groups, .combine="rbind") %do% {

        # get IDs that correspond to this group, for both test and control
        control_id_vector <- control.ids[[group.i]]
        test_id_vector <- test.ids[[group.i]]

        # get size values
        N.control <- control.Ns[[group.i]]
        N.test <- test.Ns[[group.i]]

        # search and bind
        controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
        tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
        dat.group <- rbindlist(list(controls, tests))
        dat.group[, bootstrap := n]
        return(dat.group[])
    }
    # summarize across all groups for this bootstrap and return summary stat data.table object

}
Rostrum answered 14/3, 2018 at 12:46 Comment(8)
Thank you for the suggestion and code. I wasn't able to get doMC installed properly, so I ran the code with %do% rather than %dopar%. I also adapted it to my actual dataset, which contains 24 groups rather than 2, so I have 24 different values for N.test and N.control, as well as 24 vectors of test_ids for both test and control, g1T through g24T, and so on... I tried to run it for just 1 bootstrap, but I canceled it after 5-10 minutes of spinning, so I'm not sure what went wrong. If my approach here brings to mind any suggestions, let me know. Your example code worked for me.Publican
Are the numbers of test/control not equal for for how you're trying to sample (90%/10%)? Maybe I'm misunderstanding this partRostrum
I am trying to sample 90%/10% from all groups, but not all groups have the same size. For example, Group 1 was maybe 150 test/50 control (75/25 with a total size of 200), while Group 2 was 50 test/50 control (50/50 with a total size of 100). In that situation, I'd want to sample 180 Group 1 tests, 20 Group 1 controls, 90 Group 2 tests, and 10 Group 2 controls (90/10 within both groups). But of course, in the original dataset there are actually 24 such groups. I was able to adapt the code to accommodate this -- I think -- but it still spun for a long time until I killed the process.Publican
See the edit above -- you may be able to code the N values within a loop based on multiplying 0.9 by the size of each group, but if you need to hard-code each group's counts, that option is included up there as well.Rostrum
Thank you so much! This seemed to run very quickly, but I have to tidy up my code to pull summary statistics since I was getting some NaNs that I didn't expect. I should have time to do this tomorrow and see if the results make sense.Publican
The code runs quickly, but I'm getting pretty different results for my summary statistics when I go this route than when I use my older, slower code or the dplyr code that DS_UNI suggested. I don't fully understand what to expect from this code. When it finishes running, I see dat.group in my environment. Is dat.group supposed to be the same number of records as my full dat dataset? It's only the same size as the last group.i that it pulled. I was assuming that foreach(group.i = all_groups, .combine="rbind") would bind all rows from ALL groups, but it doesn't appear to do that.Publican
dat.group is updated every time the foreach loop is ran -- each iteration of dat.group is r-binded in the object "results"Rostrum
I figured out my mistake. I have now modified the code to store that second foreach loop as a dataset: resultDat<-foreach(group.i = all_groups, .combine="rbind") %do% ... and then outside of that foreach loop, I compute summary statistics based on resultDat and return those statistics, which get pulled together into results. This runs at about 5-6 seconds per bootstrap for my large, complex dataset on a single core which is great -- a lot better than my original 200+ seconds! Thank you so much!Publican
R
1

Just like caw5cv, I recommend taking a look at data.table it is usually very efficient in solving such problems, however if you want to choose to work with dplyr then you can try doing something like this:

summary_of_boot_data <- lapply(1:nreps, 
                           function(y){
                             # get bootdata
                             bootdata <- lapply(unique(datafile$group), 
                                                function(x){
                                                  tstring<-paste0(x,"T")
                                                  cstring<-paste0(x,"C")
                                                  tsize<-round(.90*length(which(datafile$group==x)),0)
                                                  csize<-length(which(datafile$group==x))-tsize
                                                  df <-rbind(datafile[sample(which(datafile$groupT==tstring),
                                                                             size=tsize, 
                                                                             replace=TRUE),],
                                                             datafile[sample(which(datafile$groupT==cstring),
                                                                             size=csize,
                                                                             replace=TRUE),])
                                                  return(df)
                                                }) %>% do.call(rbind, .)
                             # return your summary thing for bootdata e.g. summary(bootdata)
                             summary(bootdata)

                           })
summary_of_boot_data

I tried not changing you code a lot, I just replaced the use of for with lapply

hope this helps

EDIT: Based on the comment from Hugh you might want to try using data.table::rbindlist() instead of do.call(rbind, .)

Recondite answered 14/3, 2018 at 13:29 Comment(3)
Thanks - I adapted this pretty quickly to my dataset (which actually has 24 groups), and it reduces runtime from ~200 seconds per bootstrap to ~18 seconds. That's a big improvement. I ran into issues with the data.table approach and am not sure how to fix it since it just spun for a long time rather than producing an error.Publican
rbindlist is much faster than do.call(rbind, .)Outspeak
yes I agree, but I was trying not to use anything from 'data.table', however I will add this in an EditRecondite

© 2022 - 2024 — McMap. All rights reserved.