Selecting values from specific columns and skipping NA values in R

Asked 24/8, 2021 at 19:43 Answered 25/8, 2021 at 13:51

I am working with the cancer registry data. In the following data example (ex_data), variables id and diagnosis_yr stand for ID and year at cancer diagnosis receptively. Columns x_2005 to x_2010 and y_2005 to y_2010 respectively stand for x and y status for each year (2005 to 2010).In my actual working data, I have many columns for many years (2005-2020). I would like to extract x and y values from the earliest available year, latest available year, and at the diagnosis year (ie. x_earliest, y_latest,x_at_diagnosis,y_at_diagnosis variables in "wanted" ) by excluding NAs . For id 1, for example , I would like to extract x values from the earliest year and y values from the latest year by skipping NAs. For x and y values at the diagnosis year, if there is NAs at the diagnosis year, I would like to skip NAs and extract the available data from the preceding year. How can I implement to get wanted variables in R?

library(tidyverse)

#example data
ex_data <- tribble(
~id,~diagnosis_yr,~x_2005,~x_2006,~x_2007,~x_2008,~x_2009,~x_2010,~y_2005,~y_2006,~y_2007,~y_2008,~y_2009,~y_2010,
1,  2007,   NA, NA, 1,  2,  2,  3,  "a",    "b",    "c",    "d",    "e",    NA, 
2,  2008,   1,  3,  1,  NA, 1,  2,   NA,    "b",    "b",    "e",    "d", "d",
3,  2010,   NA, 2,  2,  2,  3,  NA, "a",    "b",    "c",     NA,     NA,    NA,
4,  2009, 1,    3,  1,  NA, 1,  2,   NA,     NA,     NA,     NA,     NA,    NA,
5,  2005, NA,   1,  1,  2,  2,  3,  "a",    "b",    "c",    "d",    "e",    "e"
)

#wanted variables
wanted <- tribble(
  ~id,~diagnosis_yr,~x_earliest,~y_latest,~x_at_diagnosis,~y_at_diagnosis,
  1,    2007,   1,  "e",    1,  "c",
  2,    2008,   1,  "d",    1,  "e",
  3,    2010,   2,  "c",    3,  "c",
  4,  2009, 1,   NA,  1,  NA,
  5,  2005, 1,  "e", NA,  "a"
)

Colporteur answered 24/8, 2021 at 19:43 Comment(0)

With the help of suggested codes & strategies from @Martin and @TarJae, I would like to share the following codes (combinations of Martin's and TarJae's suggested codes) to address my question (edited version).

library (zoo)
library(dplyr)
library(tidyverse) 

ex_data %>% 
  pivot_longer(-c(id, diagnosis_yr), 
               names_to = c(".value", "year"),
               names_pattern = "(.*)_(\\d+)") %>% 
  group_by(id) %>% 
  mutate(x_earliest     = first(na.locf(x,fromLast=T,na.rm = F)),
         x_at_diagnosis = last(na.locf(x[diagnosis_yr >= year],na.rm = F)), #na.rm=F is to keep as it is if there is no replacement 
         y_latest       = last(na.locf(y,fromLast=F, na.rm =F)), 
         y_at_diagnosis = last(na.locf(y[diagnosis_yr >= year],na.rm=F))) %>% 
  dplyr::select(id, diagnosis_yr, x_earliest, y_latest, x_at_diagnosis, y_at_diagnosis) %>% 
  distinct() %>% 
  ungroup()

Out put

id       diagnosis_yr x_earliest y_latest  x_at_diagnosis  y_at_diagnosis
  <dbl>        <dbl>      <dbl>  <chr>             <dbl> <chr>         
     1         2007          1    e                    1   c             
     2         2008          1    d                    1   e             
     3         2010          2    c                    3   c             
     4         2009          1    NA                   1   NA            
     5         2005          1    e                   NA   a

Colporteur answered 25/8, 2021 at 13:51 Comment(0)

I'm not completely sure, if this is correct:

library(dplyr)
library(tidyr)

ex_data %>% 
  pivot_longer(-c(id, diagnosis_yr), 
               names_to = c(".value", "year"),
               names_pattern = "(.*)_(\\d+)") %>% 
  group_by(id) %>% 
  mutate(x_earliest     = first(na.omit(x)),
         x_at_diagnosis = last(na.omit(x[diagnosis_yr >= year])),
         y_latest       = last(na.omit(y)),
         y_at_diagnosis = last(na.omit(y[diagnosis_yr >= year]))) %>% 
  select(id, diagnosis_yr, x_earliest, y_latest, x_at_diagnosis, y_at_diagnosis) %>% 
  distinct() %>% 
  ungroup()

This returns

# A tibble: 3 x 6
     id diagnosis_yr x_earliest y_latest x_at_diagnosis y_at_diagnosis
  <dbl>        <dbl>      <dbl> <chr>             <dbl> <chr>         
1     1         2007          1 e                     1 c             
2     2         2008          1 d                     1 e             
3     3         2010          2 c                     3 c

Ludovico answered 24/8, 2021 at 22:31 Comment(8)

Took me some time to understand the actual problem. And I'm still not sure if this approach is very stable on a larger dataset... – Ludovico 24/8, 2021 at 22:52

That is exactly what I felt. I hope the final version will do it also for larger dataset. But in the end it is really joyful to have this kind of tasks. – Woodward 24/8, 2021 at 22:54

It's like solving riddles. :-) – Ludovico 24/8, 2021 at 22:58

@Woodward If you don't know it, take a look at Advent Of Code. Besides SO it's a great source of this kind of fun. Winter is coming. – Ludovico 24/8, 2021 at 23:2

Thank you very much for sharing this. I will explore it! – Woodward 24/8, 2021 at 23:5

Thanks @MartinGal and TarJae for your kind help. I tried the suggested codes in my actual data. It ran well for the rows like id 1,2 and 3 but I encountered error for the rows like id 4 and 5 (I have added new rows, id 4 and 5, in ex_data). How can I keep NAs as NAs if there is no replacement? – Colporteur 25/8, 2021 at 12:12

When I run my code with your new ex_data I get your wanted result... – Ludovico 25/8, 2021 at 12:25

Now, I have got a solution after learning your codes and TarJae's codes which are very helpful and educational. – Colporteur 25/8, 2021 at 13:59

Strategy:

Split in x and y dataframes and join them at the end, the logic for x and y are the same:
with coalesce in the correct order you can get x_earliest and y_latest.
fill NA with values before: this is a code by Anoushiravan Fill missing values with previous values by row using dplyr this is using library(zoo)
mutate a new column with values depending on year (code from akrun, thanks master).

library(zoo)
library(tidyverse)

x <- ex_data %>% 
    select(id, diagnosis_yr, starts_with("x_")) %>% 
    mutate(x_earliest= coalesce(x_2005, x_2006, x_2007, x_2008, x_2009, x_2010 )) %>% 
    mutate(pmap_df(., ~ na.locf(c(...)[-1]))) %>% # fill NA with value before
    rowwise() %>% 
    mutate(x_at_diagnosis = get(str_c('x_', diagnosis_yr)))


y <- ex_data %>% 
    select(id, diagnosis_yr, starts_with("y_")) %>% 
    mutate(y_latest = coalesce(y_2010, y_2009, y_2008, y_2007, y_2006, y_2005)) %>% 
    mutate(pmap_df(., ~ na.locf(c(...)[-1]))) %>% 
    rowwise() %>% 
    mutate(y_at_diagnosis = get(str_c('y_', diagnosis_yr))) %>% 
    type.convert(as.is=TRUE)

left_join(x, y, by=c("id", "diagnosis_yr")) %>% 
    select(id, diagnosis_yr, x_earliest, y_latest, x_at_diagnosis, y_at_diagnosis)

output:

 id diagnosis_yr x_earliest y_latest x_at_diagnosis y_at_diagnosis
  <dbl>        <dbl>      <dbl> <chr>             <dbl> <chr>         
1     1         2007          1 e                     1 c             
2     2         2008          1 d                     1 e             
3     3         2010          2 c                     3 c

Woodward answered 24/8, 2021 at 22:50 Comment(2)

Oh Man I'm speechless. Don't know what to say :D – Geraint 25/8, 2021 at 7:29

Thanks to your codes and Martin's codes which are very helpful and educational, I have leant a solution for my question (edited version) – Colporteur 25/8, 2021 at 14:2

library (zoo)
library(dplyr)
library(tidyverse) 

ex_data %>% 
  pivot_longer(-c(id, diagnosis_yr), 
               names_to = c(".value", "year"),
               names_pattern = "(.*)_(\\d+)") %>% 
  group_by(id) %>% 
  mutate(x_earliest     = first(na.locf(x,fromLast=T,na.rm = F)),
         x_at_diagnosis = last(na.locf(x[diagnosis_yr >= year],na.rm = F)), #na.rm=F is to keep as it is if there is no replacement 
         y_latest       = last(na.locf(y,fromLast=F, na.rm =F)), 
         y_at_diagnosis = last(na.locf(y[diagnosis_yr >= year],na.rm=F))) %>% 
  dplyr::select(id, diagnosis_yr, x_earliest, y_latest, x_at_diagnosis, y_at_diagnosis) %>% 
  distinct() %>% 
  ungroup()

Out put

id       diagnosis_yr x_earliest y_latest  x_at_diagnosis  y_at_diagnosis
  <dbl>        <dbl>      <dbl>  <chr>             <dbl> <chr>         
     1         2007          1    e                    1   c             
     2         2008          1    d                    1   e             
     3         2010          2    c                    3   c             
     4         2009          1    NA                   1   NA            
     5         2005          1    e                   NA   a

Colporteur answered 25/8, 2021 at 13:51 Comment(0)

Recommended topics

Hot tags