I have a dataset where I observe a variable for some individuals and not for others. For those individuals where I observe the variable, I observe it exactly once. However, the number of observations per individual as well as the position of the observed value varies.
I would like to fill all NA values for a given individual with the non-NA value in case there is a non-NA value. Otherwise, NAs are supposed to remain NA.
Here is an example dataset:
#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:100,each=10),value = NA)
#first 50 individuals get a value at the fifth observation, others don't have value
data$value[seq(5,500,10)] <- rnorm(50)
So far so good, not a big issue. Taken from another thread, we could do something like this using dplyr
and tidyr
:
data <- data %>%
group_by(group) %>% #by group
fill(value) %>% #default direction down
fill(value, .direction = "up") #also fill NAs upwards
This solves the problem perfectly fine. However, I have to do this for around 80mio. observations, which takes hours. Is there a faster method available? I think data.table
might be a good candidate.
It would also be great if it would be possible to adapt the approach to fill only NAs that appear before the value.
Thanks!
bench
package ormicrobenchmark
. Also sometimes you include defining the example data and sometimes not and OP mentions he has very big data.. I guess that should be taken into account when bencmarking. – Hodgkinson