Efficiently fill NAs by group
Asked Answered
H

3

12

I have a dataset where I observe a variable for some individuals and not for others. For those individuals where I observe the variable, I observe it exactly once. However, the number of observations per individual as well as the position of the observed value varies.

I would like to fill all NA values for a given individual with the non-NA value in case there is a non-NA value. Otherwise, NAs are supposed to remain NA.

Here is an example dataset:

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:100,each=10),value = NA)

#first 50 individuals get a value at the fifth observation, others don't have value
data$value[seq(5,500,10)] <- rnorm(50)

So far so good, not a big issue. Taken from another thread, we could do something like this using dplyr and tidyr:

data <- data %>% 
  group_by(group) %>% #by group
  fill(value) %>% #default direction down
  fill(value, .direction = "up") #also fill NAs upwards

This solves the problem perfectly fine. However, I have to do this for around 80mio. observations, which takes hours. Is there a faster method available? I think data.table might be a good candidate.

It would also be great if it would be possible to adapt the approach to fill only NAs that appear before the value.

Thanks!

Hoecake answered 9/1, 2019 at 12:33 Comment(0)
R
2

This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.

UPDATE: It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.

Last UPDATE Function comparison with rbenchmark:

library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value

#Process01
P01 <- function (data){
    data01 <- data %>% 
        group_by(group) %>% #by group
            fill(value) %>% #default direction down
            fill(value, .direction = "up") #also fill NAs upwards
    return(data01)
}

#Process02
P02 <- function (data){
    data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE), 
                                             fromLast = TRUE), group]
    return(data02)
}

#Process03
P03 <- function (data){
    dataU <- subset(unique(data), value!='NA') #keep row number
    dataM <- merge(data, dataU, by = "group", all=T) #merge tables
    data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
    return(data03)
}

benchmark("P01_dplyr" = {data01 <- P01(data)},
          "P02_zoo" = {data02 <- P02(data)},
          "P03_data.table" = {data03 <- P03(data)},
          replications = 10,
          columns = c("test", "replications", "elapsed")
          )

Results with data=10.000, 10 reps and I5 7400:

    test replications elapsed
1      P01_dplyr           10  257.78
2        P02_zoo           10   10.35
3 P03_data.table           10    0.09
Riflery answered 9/1, 2019 at 13:40 Comment(2)
Might consider studying the bench package or microbenchmark. Also sometimes you include defining the example data and sometimes not and OP mentions he has very big data.. I guess that should be taken into account when bencmarking.Hodgkinson
I've tried increasing data and process 03 wins by far.Interplay
F
8

You can use a pretty simple approach with both data.table and dplyr which - I believe - will be quite fast and efficient:

in data.table:

library(data.table)
setDT(data)
data[, value := value[!is.na(value)][1L], by = group]

or dplyr:

library(dplyr)
data <- data %>% 
  group_by(group) %>% 
  mutate(value = value[!is.na(value)][1L])

The point is you hava a non-NA value exactly o or 1 times per group. Hence you don'T need the last-observation-carried-forward logic. Just take the first non-NA value (if it exists).

Fabio answered 9/1, 2019 at 12:59 Comment(0)
J
3

We could use data.table to assign in place. Here, na.locf from zoo is used for filling the NA elements with adjacent non-NA element

library(data.table)
library(zoo)
setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE), fromLast = TRUE), group]

Benchmarks

set.seed(24)
data1 <- data.frame(group = rep(1:1e6,each=10),value = NA)
data1$value[seq(5,1e6,10)] <- rnorm(100000)

data2 <- copy(data1)

system.time({setDT(data2)[, value := na.locf(na.locf(value, 
             na.rm = FALSE), fromLast = TRUE), group]})
#   user  system elapsed 
# 70.681   0.294  70.917 


system.time({

data1 %>% 
  group_by(group) %>% #by group
  fill(value) %>% #default direction down
  fill(value, .direction = "up")

})
# 17% ~33 m remaining 

NOTE: It took a lot of time. So have to abort the session.

NOTE2 : This approach is baesd on the assumption that we want to replace the NA elements with the non-NA adjacent elements and have more than one non-NA elements per group

Jump answered 9/1, 2019 at 12:36 Comment(2)
Thanks very much! I benchmarked as well (100k observations) and got that your approach takes around 1/5 of the time of the tidyverse approach. Will be interesting to see how it scales!Hoecake
@Mr.Zen For me, the dplyr approach only completed 27% and lots more timeJump
R
2

This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.

UPDATE: It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.

Last UPDATE Function comparison with rbenchmark:

library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value

#Process01
P01 <- function (data){
    data01 <- data %>% 
        group_by(group) %>% #by group
            fill(value) %>% #default direction down
            fill(value, .direction = "up") #also fill NAs upwards
    return(data01)
}

#Process02
P02 <- function (data){
    data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE), 
                                             fromLast = TRUE), group]
    return(data02)
}

#Process03
P03 <- function (data){
    dataU <- subset(unique(data), value!='NA') #keep row number
    dataM <- merge(data, dataU, by = "group", all=T) #merge tables
    data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
    return(data03)
}

benchmark("P01_dplyr" = {data01 <- P01(data)},
          "P02_zoo" = {data02 <- P02(data)},
          "P03_data.table" = {data03 <- P03(data)},
          replications = 10,
          columns = c("test", "replications", "elapsed")
          )

Results with data=10.000, 10 reps and I5 7400:

    test replications elapsed
1      P01_dplyr           10  257.78
2        P02_zoo           10   10.35
3 P03_data.table           10    0.09
Riflery answered 9/1, 2019 at 13:40 Comment(2)
Might consider studying the bench package or microbenchmark. Also sometimes you include defining the example data and sometimes not and OP mentions he has very big data.. I guess that should be taken into account when bencmarking.Hodgkinson
I've tried increasing data and process 03 wins by far.Interplay

© 2022 - 2024 — McMap. All rights reserved.