data.table fill missing values from other rows by group

Asked 20/2, 2018 at 22:55 Answered 20/2, 2018 at 23:16

# have
> aDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA))
> aDT
    colA colB
 1:    1    4
 2:    1   NA
 3:    1   NA
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2   NA
 8:    2   NA
 9:    3    4
10:    3   NA
11:    3    2
12:    3   NA
# want
> bDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,1,1,1,4,3,3,3,4,2,2,2))
> bDT
    colA colB
 1:    1    4
 2:    1    1
 3:    1    1
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2    3
 8:    2    3
 9:    3    4
10:    3    2
11:    3    2
12:    3    2

Would like to fill missing values according to the algorithm below: within each group ('colA'),

use the value from one row below, if it's still NA, keeps going until the last row within that group
if all NAs in rows below, look at rows above (go up 1 row at a time)
if all NAs, then NA

Since the dataset is quite large, algorithmic efficiency is part of consideration. Not sure if there's any package for this type of operation already. How to do it?

Geoponic answered 20/2, 2018 at 22:55 Comment(0)

With data.table and zoo:

library(data.table)
library(zoo)

# Last observation carried forward from last row of group
dt <- dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA]

# Last observation carried forward for first row of group
dt[, colB := na.locf(colB), by = colA][]

Or in a single chain:

dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA][
   , colB := na.locf(colB), by = colA][]

Both return:

    colA colB
 1:    1    4
 2:    1    1
 3:    1    1
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2    3
 8:    2    3
 9:    3    4
10:    3    2
11:    3    2
12:    3    2

Data:

text <- "colA colB
    1    4
    1   NA
    1   NA
    1    1
    2    4
    2    3
    2   NA
    2   NA
    3    4
    3   NA
    3    2
    3   NA"

dt <- fread(input = text, stringsAsFactors = FALSE)

Shoa answered 20/2, 2018 at 23:16 Comment(0)

Here is one way using tidyverse and zoo::na.locf:

library(tidyverse);
library(zoo);
df %>%
    group_by(colA) %>%
    arrange(colA) %>%
    mutate(colB = na.locf(colB, na.rm = F, fromLast = TRUE)) %>%
    mutate(colB = na.locf(colB, na.rm = F));
## A tibble: 12 x 2
## Groups:   colA [3]
#    colA  colB
#   <dbl> <dbl>
# 1  1.00  4.00
# 2  1.00  1.00
# 3  1.00  1.00
# 4  1.00  1.00
# 5  2.00  4.00
# 6  2.00  3.00
# 7  2.00  3.00
# 8  2.00  3.00
# 9  3.00  4.00
#10  3.00  2.00
#11  3.00  2.00
#12  3.00  2.00

Or the data.table way:

library(data.table);
dt[, .(na.locf(na.locf(colB, na.rm = F, fromLast = T), na.rm = F)), by = .(colA)];
#    colA V1
# 1:    1  4
# 2:    1  1
# 3:    1  1
# 4:    1  1
# 5:    2  4
# 6:    2  3
# 7:    2  3
# 8:    2  3
# 9:    3  4
#10:    3  2
#11:    3  2
#12:    3  2

The key in both cases is to apply na.locf twice: First to replace NAs from the bottom, then replace the remaining NAs from the top.

Sample data

# As data.frame
df <- data.frame(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));
# As data.table
dt <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));

Involved answered 20/2, 2018 at 23:5 Comment(0)

library(tidyverse)

aDT%>%group_by(colA)%>%fill(colB,.direction="up")%>%fill(colB)
# A tibble: 12 x 2
# Groups:   colA [3]
    colA  colB
   <dbl> <dbl>
 1     1     4
 2     1     1
 3     1     1
 4     1     1
 5     2     4
 6     2     3
 7     2     3
 8     2     3
 9     3     4
10     3     2
11     3     2
12     3     2

Crusado answered 20/2, 2018 at 23:16 Comment(0)

Sample data

Recommended topics

Hot tags