I'm trying to use data.table/tidyverse to efficiently sample at two levels:
Level 1 is Hospital ID ( hospital_id
)
Level 2 is Doctor ID ( doctor_id
)
I need to first sample with replacement $N$ hospitals out of $N$ total. Then I need to sample with replacement $M_i$ doctors that work for hospital $i$ out of $M_i$ total.
Right now, I do it as follows: I sample a dataframe of unique hospital ids with replacement. Then I join the doctors to the hospitals they work for. Then I sample with replacement by hospital group.
But that creates a slow join. Is there a way to do this that is more efficient? This is my data.table implementation but happy to do this whichever way.
# We have a data.frame with one row for every hospital
unique_hospitals_df <- unique(hospital_df[, c("hospital_id")])
# We sample hospitals with replacement at level 1
r_sampled_hospital_ids <- unique_hospitals_df[sample(nrow(unique_hospitals_df),
floor(length(unique_hospitals_df) * sample_frac), replace=T), ]
# Now that we have the resampled ID's, we join to the doctors data.frame at level 2
r_df_full <- r_sampled_hospital_ids[,c("hospital_id", "doctor_id")][DT, on = c("hospital_id", "doctor_id"), nomatch = NULL, allow.cartesian = T]
# Now we resample the doctors within each hospital with replacement (level 2)
r_DT_resampled <- r_df_full[, .SD[sample(.N, .N, replace=T)], keyby = hospital_id]
Update: Mnist asked to explain further.
dt <- data.table(HOSP = rep(LETTERS[1:5], 1:5), DOC = letters[15:1], value = 1:15)
This gives us this data:
HOSP DOC value
1: A o 1
2: B n 2
3: B m 3
4: C l 4
5: C k 5
6: C j 6
7: D i 7
8: D h 8
9: D g 9
10: D f 10
11: E e 11
12: E d 12
13: E c 13
14: E b 14
15: E a 15
So we have two steps.
- We SWR from HOSP N times (where N is the number of items in the dataset). So we would get back a mix of A,B,C,D,E of length 15.
- We SWR from each group of DOC within each HOSP ID M times (Where M is the number of doc in the hospital). So for B we SWR n and m 2 times. For Hosp C we SWR l,k,j 3 times etc..
- All other columns in the rows should be included in the SWR operations in the end.
dt <- data.table(HOSP = rep(LETTERS[1:5], 1:5), DOC = letters[15:1], value = 1:15)
, what would a specific sampling at each step mean for the final result. For example, given we sample HOSP C twice, etc. – Premature