Replacing NA's within a set of similar values
Asked Answered
H

5

6

I have a column in a dataset looking like this:

cluster_id
1
1
1
1
NA
1
NA
NA
2
NA
2
NA
3
NA
NA
3


cluster_id <- c("1","1","1","1","NA","1","NA","NA","2","NA","2","NA","3","NA","NA","3")

The order is already pre-defined before using a time column. What I want is to substitute the NA's that are within each cluster ID, i.e. if there's a row with 2, then an NA, and then a 2 again, I want that NA to become 2. The NA's between numbers stay as NA's. Example:

cluster_id   cluster_id_new
1            1
1            1
1            1
1            1
NA           1
1            1
NA           NA
NA           NA
2            2
NA           2
2            2
NA           NA
3            3
NA           3
NA           3
3            3 

I found the zoo::na.locf function in this post, which seems to be close to what I want, but I also need to take in consideration the value after the NA. Any thoughts?

Hottempered answered 27/9 at 10:0 Comment(7)
Great question! Two quick questions: are these values actually characters - i.e., the value actually "NA" (a character value) or NA (missing value) or "1" (character) or 1 (numeric) - and are your data in a single vector like the example, or a data frame?Carioca
See if this question is similar #56134578Jacquelyn
How would want to handle leading/trailing NAs?Spotter
@Carioca thanks! the numbers are int, and the NA is a missing value, and the data is actually in a dataframe! But the solutions provided for a single vector seem to work well :) I just wanted to simplify the question/postHottempered
@Spotter not sure I understood what do you mean? e.g. an NA at a start or end of a sequence?Hottempered
Yes, for example, if cluster_id had an NA as the very first or very last value. (See my answer--I added an NA at the beginning and end of cluster_id so that the behavior of the solutions was clear.)Spotter
Ah I see now @jblood94, that was a great remark, thanks!!Hottempered
S
7

Update: thanks to @Darren Tsai (who provided a better version):

library(dplyr)
library(tidyr)

tibble(cluster_id) %>% 
  mutate(down = cluster_id, up = cluster_id) %>% 
  fill(down, .direction = "down") %>% 
  fill(up, .direction = "up") %>% 
  mutate(cluster_id_new = if_else(down == up, down, NA)) %>% 
  select(-c(down, up))

Original answer: Here is tidyverse way using mainly fill:

library(dplyr)
library(tidyr)

tibble(cluster_id) %>%
  mutate(helper = row_number(),
         filldown = fill(., cluster_id, .direction = "down")$cluster_id,
         fillup = fill(., cluster_id, .direction = "up")$cluster_id,
         cluster_id_new = ifelse(filldown == fillup, filldown, NA_real_)) %>%
  select(cluster_id, cluster_id_new)
cluster_id cluster_id_new
        <dbl>          <dbl>
 1          1              1
 2          1              1
 3          1              1
 4          1              1
 5         NA              1
 6          1              1
 7         NA             NA
 8         NA             NA
 9          2              2
10         NA              2
11          2              2
12         NA             NA
13          3              3
14         NA              3
15         NA              3
16          3              3
Staffordshire answered 27/9 at 11:25 Comment(3)
I have the same idea! Maybe your code can be more readable like this: df %>% mutate(down = cluster_id, up = cluster_id) %>% fill(down, .direction = "down") %>% fill(up, .direction = "up") %>% mutate(cluster_id_new = if_else(down == up, down, NA)) %>% select(-c(down, up))Chivers
This works very well, thanks! Hard to choose one as most of them works, but is one might be the "cleanest", so I'll choose this approachHottempered
also @DarrenTsai that works too :)Hottempered
S
7

The collapse package has both a na_locf and a na_focb. You could do a pass with each then check for discrepancies.

If your NA values are stored as characters and not NA_character_, you first need to convert to NA_character_.

The solution also depends on how you want to handle leading/trailing "NA"s. The first option keeps them.

cluster_id <- c("NA","1","1","1","NA","1","NA","NA","2","NA","2","NA","3","NA","NA","3","NA")

library(collapse)

i <- which(cluster_id == "NA")
cluster_id[i] <- NA
cluster_id_filled <- na_locf(cluster_id)
cluster_id_filled[-which(cluster_id_filled == na_focb(cluster_id))] <- "NA"
cluster_id[i] <- "NA" # restore the original vector
cluster_id_filled
#>  [1] "NA" "1"  "1"  "1"  "1"  "1"  "NA" "NA" "2"  "2"  "2"  "NA" "3"  "3"  "3"  "3"  "NA"

This second solution uses a single for loop and fills leading/trailing "NA"s.

cluster_id_filled <- cluster_id
j <- 0
for (i in seq_along(cluster_id)) {
  if (cluster_id[i] == "NA") {
    j <- j + 1
  } else if (j != 0) {
    # use `!isFALSE` to handle leading "NA"s
    if (!isFALSE(cluster_id[i] == cluster_id[i - j - 1])) {
      cluster_id_filled[(i - j):(i - 1)] <- cluster_id[i]
    }
    j <- 0
  }
}
# handle trailing "NA"s
if (j != 0) cluster_id_filled[(i - j + 1):i] <- cluster_id_filled[i - j]
cluster_id_filled
#>  [1] "1"  "1"  "1"  "1"  "1"  "1"  "NA" "NA" "2"  "2"  "2"  "NA" "3"  "3"  "3"  "3"  "3"

If your NA values are actually NA, it's a little easier.

cluster_id <- c(NA, 1, 1, 1, 1, NA, 1, NA, NA, 2, NA, 2, NA, 3, NA, NA, 3, NA)
cluster_id_filled <- na_locf(cluster_id)
cluster_id_filled[-which(cluster_id_filled == na_focb(cluster_id))] <- NA
cluster_id_filled
#>  [1] NA  1  1  1  1  1  1 NA NA  2  2  2 NA  3  3  3  3 NA
Spotter answered 27/9 at 11:28 Comment(2)
Brilliant answer! I like your for-loop approach too - very cleverLuciana
Thanks!! This one worked too, thanks for sharing!Hottempered
T
7

Here is a base R option with rle + cumsum

cluster_id_new <- with(
    rle(cluster_id),
    {
        idx <- values != "NA"
        grp <- cumsum(c(TRUE, head(values[idx], -1) != tail(values[idx], -1)))
        rep(ave(values,
            cummax(replace(idx, which(idx), grp)),
            FUN = \(x) {
                u <- unique(x[x != "NA"])
                if (tail(x, 1) == "NA") {
                    c(rep(u, length(x) - 1), "NA")
                } else {
                    u
                }
            }
        ), lengths)
    }
)

such that

> data.frame(cluster_id, cluster_id_new)
   cluster_id cluster_id_new
1           1              1
2           1              1
3           1              1
4           1              1
5          NA              1
6           1              1
7          NA             NA
8          NA             NA
9           2              2
10         NA              2
11          2              2
12         NA             NA
13          3              3
14         NA              3
15         NA              3
16          3              3
Transgression answered 27/9 at 13:17 Comment(0)
H
6

1) na.approx Presumably the cluster id's are integer and it seems they increment by 1 so performing linear interpolation and testing the result for integer is sufficient to determine whether the result should be NA or not.

library(zoo)

x <- cluster_id
a <- na.approx(as.numeric(ifelse(x == "NA", NA, x)), na.rm = FALSE)
data.frame(cluster_id,
           cluster_id_new = ifelse(a == as.integer(a), a, NA)
)

giving:

   cluster_id cluster_id_new
1           1              1
2           1              1
3           1              1
4           1              1
5        <NA>              1
6           1              1
7        <NA>             NA
8        <NA>             NA
9           2              2
10       <NA>              2
11          2              2
12       <NA>             NA
13          3              3
14       <NA>              3
15       <NA>              3
16          3              3

2) na.locf0 If the forward and backward na.locf0 are the same for an NA then it should be replaced. This is more general since it works directly with the character values and does not assume the id's increase by 1.

library(zoo)

a <- na.locf0(cluster_id)
data.frame(cluster_id,
  cluster_id_new = ifelse(a == na.locf0(cluster_id, fromLast = TRUE), a, NA)
)

Note

cluster_id <- c("1", "1", "1", "1", NA, "1", NA, NA, "2", NA, "2", NA, "3", NA, NA, "3")
Harkey answered 27/9 at 12:46 Comment(0)
L
4

I can't think of a 'clever' solution using existing functions that satisfies your criteria; would a for-loop approach work?

cluster_id <- c("1","1","1","1","NA","1","NA","NA","2","NA","2","NA","3","NA","NA","3")
for (i in seq_along(cluster_id[-1])) {
  if(cluster_id[i + 1] == "NA") {
    for(j in (i + 1):length(cluster_id)) {
      if(cluster_id[i] == cluster_id[j]) {
        cluster_id[i + 1] <- cluster_id[j]
      }
    }
  }
}
cluster_id
#>  [1] "1"  "1"  "1"  "1"  "1"  "1"  "NA" "NA" "2"  "2"  "2"  "NA" "3"  "3"  "3" 
#> [16] "3"

Created on 2024-09-27 with reprex v2.1.0

Luciana answered 27/9 at 10:50 Comment(1)
Thanks!! This one worked too, thanks for sharing!Hottempered

© 2022 - 2024 — McMap. All rights reserved.