Replacing NA's within a set of similar values

Asked 27/9, 2024 at 10:0 Answered 27/9, 2024 at 13:17

I have a column in a dataset looking like this:

cluster_id
1
1
1
1
NA
1
NA
NA
2
NA
2
NA
3
NA
NA
3


cluster_id <- c("1","1","1","1","NA","1","NA","NA","2","NA","2","NA","3","NA","NA","3")

The order is already pre-defined before using a time column. What I want is to substitute the NA's that are within each cluster ID, i.e. if there's a row with 2, then an NA, and then a 2 again, I want that NA to become 2. The NA's between numbers stay as NA's. Example:

cluster_id   cluster_id_new
1            1
1            1
1            1
1            1
NA           1
1            1
NA           NA
NA           NA
2            2
NA           2
2            2
NA           NA
3            3
NA           3
NA           3
3            3

I found the zoo::na.locf function in this post, which seems to be close to what I want, but I also need to take in consideration the value after the NA. Any thoughts?

Hottempered answered 27/9, 2024 at 10:0 Comment(7)

Great question! Two quick questions: are these values actually characters - i.e., the value actually "NA" (a character value) or NA (missing value) or "1" (character) or 1 (numeric) - and are your data in a single vector like the example, or a data frame? – Carioca 27/9, 2024 at 10:57

See if this question is similar #56134578 – Jacquelyn 27/9, 2024 at 11:46

How would want to handle leading/trailing NAs? – Spotter 27/9, 2024 at 11:55

@Carioca thanks! the numbers are int, and the NA is a missing value, and the data is actually in a dataframe! But the solutions provided for a single vector seem to work well :) I just wanted to simplify the question/post – Hottempered 27/9, 2024 at 13:26

@Spotter not sure I understood what do you mean? e.g. an NA at a start or end of a sequence? – Hottempered 27/9, 2024 at 13:28

Yes, for example, if cluster_id had an NA as the very first or very last value. (See my answer--I added an NA at the beginning and end of cluster_id so that the behavior of the solutions was clear.) – Spotter 27/9, 2024 at 13:40

Ah I see now @jblood94, that was a great remark, thanks!! – Hottempered 27/9, 2024 at 13:49

Update: thanks to @Darren Tsai (who provided a better version):

library(dplyr)
library(tidyr)

tibble(cluster_id) %>% 
  mutate(down = cluster_id, up = cluster_id) %>% 
  fill(down, .direction = "down") %>% 
  fill(up, .direction = "up") %>% 
  mutate(cluster_id_new = if_else(down == up, down, NA)) %>% 
  select(-c(down, up))

Original answer: Here is tidyverse way using mainly fill:

library(dplyr)
library(tidyr)

tibble(cluster_id) %>%
  mutate(helper = row_number(),
         filldown = fill(., cluster_id, .direction = "down")$cluster_id,
         fillup = fill(., cluster_id, .direction = "up")$cluster_id,
         cluster_id_new = ifelse(filldown == fillup, filldown, NA_real_)) %>%
  select(cluster_id, cluster_id_new)

cluster_id cluster_id_new
        <dbl>          <dbl>
 1          1              1
 2          1              1
 3          1              1
 4          1              1
 5         NA              1
 6          1              1
 7         NA             NA
 8         NA             NA
 9          2              2
10         NA              2
11          2              2
12         NA             NA
13          3              3
14         NA              3
15         NA              3
16          3              3

Staffordshire answered 27/9, 2024 at 11:25 Comment(3)

I have the same idea! Maybe your code can be more readable like this:

df %>% mutate(down = cluster_id, up = cluster_id) %>% fill(down, .direction = "down") %>% fill(up, .direction = "up") %>% mutate(cluster_id_new = if_else(down == up, down, NA)) %>% select(-c(down, up))

– Chivers 27/9, 2024 at 11:50

This works very well, thanks! Hard to choose one as most of them works, but is one might be the "cleanest", so I'll choose this approach – Hottempered 27/9, 2024 at 13:35

also @DarrenTsai that works too :) – Hottempered 27/9, 2024 at 13:35

The collapse package has both a na_locf and a na_focb. You could do a pass with each then check for discrepancies.

If your NA values are stored as characters and not NA_character_, you first need to convert to NA_character_.

The solution also depends on how you want to handle leading/trailing "NA"s. The first option keeps them.

cluster_id <- c("NA","1","1","1","NA","1","NA","NA","2","NA","2","NA","3","NA","NA","3","NA")

library(collapse)

i <- which(cluster_id == "NA")
cluster_id[i] <- NA
cluster_id_filled <- na_locf(cluster_id)
cluster_id_filled[-which(cluster_id_filled == na_focb(cluster_id))] <- "NA"
cluster_id[i] <- "NA" # restore the original vector
cluster_id_filled
#>  [1] "NA" "1"  "1"  "1"  "1"  "1"  "NA" "NA" "2"  "2"  "2"  "NA" "3"  "3"  "3"  "3"  "NA"

This second solution uses a single for loop and fills leading/trailing "NA"s.

cluster_id_filled <- cluster_id
j <- 0
for (i in seq_along(cluster_id)) {
  if (cluster_id[i] == "NA") {
    j <- j + 1
  } else if (j != 0) {
    # use `!isFALSE` to handle leading "NA"s
    if (!isFALSE(cluster_id[i] == cluster_id[i - j - 1])) {
      cluster_id_filled[(i - j):(i - 1)] <- cluster_id[i]
    }
    j <- 0
  }
}
# handle trailing "NA"s
if (j != 0) cluster_id_filled[(i - j + 1):i] <- cluster_id_filled[i - j]
cluster_id_filled
#>  [1] "1"  "1"  "1"  "1"  "1"  "1"  "NA" "NA" "2"  "2"  "2"  "NA" "3"  "3"  "3"  "3"  "3"

If your NA values are actually NA, it's a little easier.

cluster_id <- c(NA, 1, 1, 1, 1, NA, 1, NA, NA, 2, NA, 2, NA, 3, NA, NA, 3, NA)
cluster_id_filled <- na_locf(cluster_id)
cluster_id_filled[-which(cluster_id_filled == na_focb(cluster_id))] <- NA
cluster_id_filled
#>  [1] NA  1  1  1  1  1  1 NA NA  2  2  2 NA  3  3  3  3 NA

Spotter answered 27/9, 2024 at 11:28 Comment(2)

Brilliant answer! I like your for-loop approach too - very clever – Luciana 27/9, 2024 at 11:44

Thanks!! This one worked too, thanks for sharing! – Hottempered 27/9, 2024 at 13:36

Here is a base R option with rle + cumsum

cluster_id_new <- with(
    rle(cluster_id),
    {
        idx <- values != "NA"
        grp <- cumsum(c(TRUE, head(values[idx], -1) != tail(values[idx], -1)))
        rep(ave(values,
            cummax(replace(idx, which(idx), grp)),
            FUN = \(x) {
                u <- unique(x[x != "NA"])
                if (tail(x, 1) == "NA") {
                    c(rep(u, length(x) - 1), "NA")
                } else {
                    u
                }
            }
        ), lengths)
    }
)

such that

> data.frame(cluster_id, cluster_id_new)
   cluster_id cluster_id_new
1           1              1
2           1              1
3           1              1
4           1              1
5          NA              1
6           1              1
7          NA             NA
8          NA             NA
9           2              2
10         NA              2
11          2              2
12         NA             NA
13          3              3
14         NA              3
15         NA              3
16          3              3

Transgression answered 27/9, 2024 at 13:17 Comment(0)

1) na.approx Presumably the cluster id's are integer and it seems they increment by 1 so performing linear interpolation and testing the result for integer is sufficient to determine whether the result should be NA or not.

library(zoo)

x <- cluster_id
a <- na.approx(as.numeric(ifelse(x == "NA", NA, x)), na.rm = FALSE)
data.frame(cluster_id,
           cluster_id_new = ifelse(a == as.integer(a), a, NA)
)

giving:

   cluster_id cluster_id_new
1           1              1
2           1              1
3           1              1
4           1              1
5        <NA>              1
6           1              1
7        <NA>             NA
8        <NA>             NA
9           2              2
10       <NA>              2
11          2              2
12       <NA>             NA
13          3              3
14       <NA>              3
15       <NA>              3
16          3              3

2) na.locf0 If the forward and backward na.locf0 are the same for an NA then it should be replaced. This is more general since it works directly with the character values and does not assume the id's increase by 1.

library(zoo)

a <- na.locf0(cluster_id)
data.frame(cluster_id,
  cluster_id_new = ifelse(a == na.locf0(cluster_id, fromLast = TRUE), a, NA)
)

Note

cluster_id <- c("1", "1", "1", "1", NA, "1", NA, NA, "2", NA, "2", NA, "3", NA, NA, "3")

Harkey answered 27/9, 2024 at 12:46 Comment(0)

I can't think of a 'clever' solution using existing functions that satisfies your criteria; would a for-loop approach work?

cluster_id <- c("1","1","1","1","NA","1","NA","NA","2","NA","2","NA","3","NA","NA","3")
for (i in seq_along(cluster_id[-1])) {
  if(cluster_id[i + 1] == "NA") {
    for(j in (i + 1):length(cluster_id)) {
      if(cluster_id[i] == cluster_id[j]) {
        cluster_id[i + 1] <- cluster_id[j]
      }
    }
  }
}
cluster_id
#>  [1] "1"  "1"  "1"  "1"  "1"  "1"  "NA" "NA" "2"  "2"  "2"  "NA" "3"  "3"  "3" 
#> [16] "3"

^{Created on 2024-09-27 with reprex v2.1.0}

Luciana answered 27/9, 2024 at 10:50 Comment(1)

Thanks!! This one worked too, thanks for sharing! – Hottempered 27/9, 2024 at 13:37

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Note

Recommended topics

Hot tags