Count NAs between first and last occured numbers

Asked 12/2, 2020 at 13:28 Answered 13/2, 2020 at 3:21

Solved r dataframe dplyr na missing-data

Here is my toy dataset

df <- tribble(
  ~x, ~y, ~z,
  7,   NA, 4,
  8,   2,  NA,
  NA,  NA, NA,
  NA,  4,  6)

I want to get a dataframe with a number of NAs for each variable only between the first and the last occurrence of numbers in each column and number of NAs between the first occurred number and last row. So, for this example, the desired solution is

desired_df <- tribble(~vars, ~na_count_between_1st_last_num, ~na_count_between_1st_num_last_row,
                       "x",     0,                              2,
                       "y",     1,                              1,
                       "z",     2,                              2)

How can I get the desired output?

Mada answered 12/2, 2020 at 13:28 Comment(0)

na.trim trims NAs off both ends or just the left or right end if we specify sides="left" or sides="right" so:

library(dplyr)
library(tibble)
library(tidyr)
library(zoo)

df %>%
  pivot_longer(everything()) %>%
  group_by(name) %>%
  summarize(na1 = sum(is.na(na.trim(value))), 
            na2 = sum(is.na(na.trim(value, "left")))) %>%
  ungroup

giving:

# A tibble: 3 x 3
  name    na1   na2
  <chr> <int> <int>
1 x         0     2
2 y         1     1
3 z         2     2

Hydrothorax answered 12/2, 2020 at 14:40 Comment(0)

Here is an idea via base R,

f1 <- function(x) {i1 <- which(!is.na(x)); head(i1, 1):tail(i1, 1) }
f2 <- function(x) {i1 <- which(!is.na(x)); head(i1, 1):length(x) }

merge(stack(sapply(df, function(i) sum(is.na(i[f1(i)])))), 
      stack(sapply(df, function(i) sum(is.na(i[f2(i)])))), by = 'ind')

#  ind values.x values.y
#1   x        0        2
#2   y        1        1
#3   z        2        2

Caulk answered 12/2, 2020 at 13:57 Comment(0)

na.trim trims NAs off both ends or just the left or right end if we specify sides="left" or sides="right" so:

library(dplyr)
library(tibble)
library(tidyr)
library(zoo)

df %>%
  pivot_longer(everything()) %>%
  group_by(name) %>%
  summarize(na1 = sum(is.na(na.trim(value))), 
            na2 = sum(is.na(na.trim(value, "left")))) %>%
  ungroup

giving:

# A tibble: 3 x 3
  name    na1   na2
  <chr> <int> <int>
1 x         0     2
2 y         1     1
3 z         2     2

Hydrothorax answered 12/2, 2020 at 14:40 Comment(0)

Here is one possibility using two functions:

fun1 <- function(x) { #count NA between first and last non NA
  idx1 <- cumsum(!is.na(x)) > 0 #identify leading NA
  idx2 <- rev(cumsum(!is.na(rev(x))) > 0) #identify trailing NA
  sum(is.na(x[idx1 & idx2]))
}


fun2 <- function(x) {#count NA between first non-NA and last element
  idx1 <- cumsum(!is.na(x)) > 0 #identify leading NA
  sum(is.na(x[idx1]))
}

Afterwards you just summarise your data.frame and reshape it:

df %>% summarise_all(list(m1 = ~fun1(.), m2 = ~fun2(.))) %>%
  pivot_longer(cols = everything(), names_pattern = "^(.)_(.*)$", names_to = c("vars", "a"),
               values_to = "x") %>%
  spread(a, x)

# A tibble: 3 x 3
  vars     m1    m2
  <chr> <int> <int>
1 x         0     2
2 y         1     1
3 z         2     2

Dikmen answered 12/2, 2020 at 13:50 Comment(2)

On my real data, I get the output as # A tibble: 1 x 2 vars <NA> <chr> <int> 1 NA 0 , not sure why. Is it because of the names_pattern = "^(.)_(.*)$" – Mada 12/2, 2020 at 14:55

probabily because of names_pattern. Best check ?pivot_longer to see how to use names_pattern. Maybe you can simply use names_pattern = "^(.*)_(.*)$". – Dikmen 12/2, 2020 at 15:25

Here is another option using data.table::nafill:

library(data.table)
natrail <- colSums(is.na(as.data.table(nafill(df, "nocb"))))
nastart <- colSums(is.na(as.data.table(nafill(df, "locf"))))    
n1last <- nrow(df) - colSums(!is.na(df)) - nastart
n1num <- n1last - natrail

cbind(na_count_between_1st_last_num=n1num, na_count_between_1st_num_last_row=n1last)

output:

  na_count_between_1st_last_num na_count_between_1st_num_last_row
x                             0                                 2
y                             1                                 1
z                             2                                 2
a                             1                                 2
b                             0                                 0
d                             0                                 1

data:

df <- data.frame(x=c(7,8,NA,NA), #0 2
    y=c(NA, 2, NA, 4),           #1 1
    z=c(4, NA, NA, 6),           #2 2
    a=c(1, NA, 1, NA),           #1 2
    b=c(NA, NA, 1, 1),           #0 0
    d=c(NA, 1, 1, NA))           #0 1

Colubrine answered 13/2, 2020 at 3:21 Comment(0)

Recommended topics

Hot tags