How to efficiently sample with replacement at two levels of hierarchy with data.table/dplyr?
Asked Answered
T

2

3

I'm trying to use data.table/tidyverse to efficiently sample at two levels:

Level 1 is Hospital ID ( hospital_id )
Level 2 is Doctor ID ( doctor_id )

I need to first sample with replacement $N$ hospitals out of $N$ total. Then I need to sample with replacement $M_i$ doctors that work for hospital $i$ out of $M_i$ total.

Right now, I do it as follows: I sample a dataframe of unique hospital ids with replacement. Then I join the doctors to the hospitals they work for. Then I sample with replacement by hospital group.

But that creates a slow join. Is there a way to do this that is more efficient? This is my data.table implementation but happy to do this whichever way.

# We have a data.frame with one row for every hospital
unique_hospitals_df <- unique(hospital_df[, c("hospital_id")])

# We sample hospitals with replacement at level 1
r_sampled_hospital_ids <- unique_hospitals_df[sample(nrow(unique_hospitals_df),
                      floor(length(unique_hospitals_df) * sample_frac), replace=T), ]

# Now that we have the resampled ID's, we join to the doctors data.frame at level 2
r_df_full <- r_sampled_hospital_ids[,c("hospital_id", "doctor_id")][DT, on = c("hospital_id", "doctor_id"), nomatch = NULL, allow.cartesian = T]

# Now we resample the doctors within each hospital with replacement (level 2)
r_DT_resampled <- r_df_full[, .SD[sample(.N, .N, replace=T)], keyby = hospital_id]

Update: Mnist asked to explain further.

dt <- data.table(HOSP = rep(LETTERS[1:5], 1:5),                  DOC = letters[15:1],                  value = 1:15)

This gives us this data:

   HOSP DOC value
 1:    A   o     1
 2:    B   n     2
 3:    B   m     3
 4:    C   l     4
 5:    C   k     5
 6:    C   j     6
 7:    D   i     7
 8:    D   h     8
 9:    D   g     9
10:    D   f    10
11:    E   e    11
12:    E   d    12
13:    E   c    13
14:    E   b    14
15:    E   a    15

So we have two steps.

  1. We SWR from HOSP N times (where N is the number of items in the dataset). So we would get back a mix of A,B,C,D,E of length 15.
  2. We SWR from each group of DOC within each HOSP ID M times (Where M is the number of doc in the hospital). So for B we SWR n and m 2 times. For Hosp C we SWR l,k,j 3 times etc..
  3. All other columns in the rows should be included in the SWR operations in the end.
Triglyceride answered 7/7, 2023 at 5:26 Comment(10)
Please expand on your sampling needs. Do you mean that if you have (say) 20 hospitals and 50 doctors at each, you want to choose 4 of those hospitals and then 8 doctors at each of those 4 hospitals?Vulcanize
I would want to sample with replacement all 20 and all 60.Triglyceride
Does sampling with replacement from the concatenation of hospital and physician IDs do the job? On another note, had there been only one level of sampling, the following construct is helpful fast and and may possibly give you an idea for your situation. (1) Create a new object that is split(1 : .N, hospital) to create a list of vectors of original observation numbers. (2) Get a sample with replacement of hospitals. (3) Use this to index into the split object. (4) unlist the subscripted split object to get all the observation numbers (row numbers) to fetch for your sample (integer vector).Pachysandra
@FrankHarrell Hi Frank. I need to preserve the hierarchy. So first SWR N from N hospitals Then foreach i hospital with Mi doctors, SWR Mi doctors. Thanks for that idea. I have similarly seen folks split the hospitals into a list of data.frames. Then we can SWR from the list and already have the doctors within each SWR data.frame to SWR from at the second stage. However, it seems to be quite slow? I will try your idea.Triglyceride
Could you please provide a minimal reproducible example? Given this data, dt <- data.table(HOSP = rep(LETTERS[1:5], 1:5), DOC = letters[15:1], value = 1:15), what would a specific sampling at each step mean for the final result. For example, given we sample HOSP C twice, etc.Premature
@Premature OK done. I reexplained again. Let me know if this is not enough.Triglyceride
Your explanation in 1 and 2 kinda contradict each other. From point 1 I read you can draw any number of B by chance up to and including 15. Then in point 2 you mention B being sampled always twice and C be sampled 3 times. If Point 2 is what you want there is no need to sample on Hospital first at all.Cyprinid
@MerijnvanTilborg let me clarify. We need to first sample on hospital with replacement because we may end up with duplicate hospitals after SWR. Then in each sample made we sample with replacement its doctors.Triglyceride
I provided an answer based on first sampling your Hospitals and from there the Doctors, if that is not how you meant it to be, please respond there and discuss the outcome.Cyprinid
@MerijnvanTilborg I see it. I'll respond shortly. Thank you for taking the time to reply.Triglyceride
C
1

I hope I understand your desired outcome, but here we first sample the Hospitals, with set.seed(1) we get 4xA, 4xB, etc. (see the mn). Then we sample the doctors based on those draws. This means we get 4x Doctor o as there is just one to sample from for that Hospital. For Hospital E we had sampled it 3 times, but as your requirement is sample with replacement, we happen to get Doctor b once and Doctor e twice while the other available Doctors were not drawn.

dt <- data.table(
  HOSP = rep(LETTERS[1:5], 1:5),
  DOC = letters[15:1],
  value = 1:15
)

set.seed(1)

# sample your hospitals and get the counts
mn <- table(sample(unique(dt$HOSP), nrow(dt), replace = TRUE))

# mn
# A B C D E 
# 4 4 3 1 3

# get your doctor sample based on the hospital samples in mn
samples <- data.table(stack(sapply(names(mn), FUN = \(x) dt[HOSP == x, sample(unlist(.SD), mn[x], replace = TRUE), .SDcols = "DOC"])))
setnames(samples, new = c("DOC", "HOSP"))

# join to get the values
samples[dt, on = c("HOSP", "DOC"), nomatch = 0]

#     DOC HOSP value
# 1:    o    A     1
# 2:    o    A     1
# 3:    o    A     1
# 4:    o    A     1
# 5:    n    B     2
# 6:    m    B     3
# 7:    m    B     3
# 8:    m    B     3
# 9:    l    C     4
# 10:   k    C     5
# 11:   j    C     6
# 12:   h    D     8
# 13:   e    E    11
# 14:   e    E    11
# 15:   b    E    14
Cyprinid answered 11/7, 2023 at 10:56 Comment(0)
P
1

I tried to make your example reproducible. This would be my way to improve it

Solution

library(data.table)
set.seed(123)
n <- 50000
dt <- data.table(HOSP  = sort(sample(LETTERS, size = n, replace = TRUE)),
                 DOC   = 1,
                 VALUE = seq_len(n))
dt[, DOC := seq_len(.N), by = HOSP]

# sample Hospitals with replacement on first stage
set.seed(123)
hosps <- sample(unique(dt$HOSP), size = length(unique(dt$HOSP)), replace = TRUE)

# create weights for sampling
dt[, WEIGHT := colSums(sapply(HOSP, `==`, hosps))]

# Sample doctors with replacement on second stage
dt[WEIGHT > 0, 
   .SD[sample(seq_len(.N), size = .N * WEIGHT[1], replace = TRUE)], 
   by = .(HOSP)]
#>        HOSP  DOC VALUE WEIGHT
#>     1:    C  932  4809      2
#>     2:    C 1614  5491      2
#>     3:    C  593  4470      2
#>     4:    C  555  4432      2
#>     5:    C  373  4250      2
#>    ---                       
#> 50308:    Z  205 48246      2
#> 50309:    Z 1804 49845      2
#> 50310:    Z 1805 49846      2
#> 50311:    Z 1086 49127      2
#> 50312:    Z 1637 49678      2

Performance

microbenchmark::microbenchmark(
  old = {
    set.seed(123)
    # We sample hospitals with replacement at level 1
    r_sampled_hospital_ids <- dt[sample(nrow(dt),
                                        size = nrow(dt),
                                        replace=T), ]
    
    # Now that we have the resampled ID's, we join to the doctors data.frame at level 2
    r_df_full <- r_sampled_hospital_ids[, .(HOSP, DOC)][dt, on = .(HOSP, DOC), nomatch = NULL, allow.cartesian = T]
    
    # Now we resample the doctors within each hospital with replacement (level 2)
    r_DT_resampled <- r_df_full[, .SD[sample(.N, .N, replace = T)], keyby = HOSP]
  },
  new = {
    set.seed(123)
    hosps <- sample(unique(dt$HOSP),
                    size = length(unique(dt$HOSP)), replace = TRUE)
    
    # create weights for sampling
    dt[, WEIGHT := colSums(sapply(HOSP, `==`, hosps))]
    
    # Sample doctors with replacement on second stage
    dt[WEIGHT > 0, 
       .SD[sample(seq_len(.N), size = .N * WEIGHT[1], replace = TRUE)], 
       by = .(HOSP)]
    
  },
  times = 1
)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>   old 227.3417 227.3417 227.3417 227.3417 227.3417 227.3417     1
#>   new 108.6217 108.6217 108.6217 108.6217 108.6217 108.6217     1
Premature answered 11/7, 2023 at 15:18 Comment(2)
One issue here -- which I only realized just as you wrote this -- is this breaks the structure, so to speak. Though I think it can be fixed somewhat easily. In the end, I need to still know which sample hospital the doctors belong to. In other words, if we SWR hospitals and choose a hospital twice, we need to know which doctors SWR from the hospitals are members of each of the two hospitals. I'm wondering how we could build that duplicate indexing in here.Triglyceride
The way I've handled it is, after SWR hospitals, I give each hospital a unique id in a new column, then group by that new column. In this case, that won't work because we're just using weighted sampling. What do you think?Triglyceride

© 2022 - 2024 — McMap. All rights reserved.