Select values row-wise based on rank among dates
Asked Answered
O

6

9

Let's say I have a data frame with several rows like the following:

df <- data.frame(a = c(NA,20,NA),
                 date1 = c("2016-03-01", "2016-02-01", "2016-02-01"),
                 b = c(50,NA, NA),
                 date2 = c("2016-02-01", "2016-03-01", "2016-03-01"),
                 c = c(10,10, 10),
                 date3 = c("2016-01-01","2016-01-01", "2016-01-01"))

For each row, I want to get the latest value which is not a NA between a, b, and c according to the dates (so I respectively look at date1, date2, or date3 and pick the most recent one).

Basically, date1 gives the date corresponding to the value a, date2 gives the date corresponding to the value b, date3 gives the date corresponding to the value c.

If date1 > date2 & date1 > date3, I will want to take the value a However, if the value a is NA (which is the case in my example), I will compare date2 and date3. In my example, date2 > date3 , and since value b is not NA but 50, I will take 50 as my final result.

Now I want to do this for all the rows in my dataframe

Since I am using dplyr, I tried to use the case_when function by using the rank function (in my example, I look a the first ranked date, and then look at the linked value. If it is a NA, I look at the 2nd best ranked, etc...)

However, I can't just put, as I'd like to do, :

df <- df %>%
        mutate(result = case_when(is.na(a) & is.na(b) & is.na(c) ~ NA_integer_,
                                  rev(rank(date1, date2, date3))[1] == 3 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 3 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 3 & !is.na(a) ~ c,
                                  rev(rank(date1, date2, date3))[1] == 2 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 2 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 2 & !is.na(a) ~ c,
                                  rev(rank(date1, date2, date3))[1] == 1 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 1 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 1 & !is.na(a) ~ c))

Because the rank function needs a unique vector as argument (but I can't put c(date1, date2, date3) neither because it would give me the whole order of this vector and not the rank for each row)

In my example the result I would like to have would be

res

a    date1         b      date2       c    date3       result
NA   2016-03-01    50     2016-02-01  10   2016-01-01  50
20   2016-02-01    NA     2016-03-01  10   2016-01-01  20
NA   2016-02-01    NA     2016-03-01  10   2016-01-01  10

Does anyone have an idea or even an entirely different approach to this problem ?

Obliterate answered 28/8, 2017 at 12:52 Comment(0)
M
4

I suggest converting to long-format and computing the relevant values. If you want, you can then add the results to your original data.frame. Here's how you could do that using data.table:

library(data.table)
setDT(df)                     # convert to data.table object
df[, row := .I]               # add a row-id
dflong <- melt(df, id = "row", measure = patterns("^date", "^(a|b|c)"),
               na.rm = TRUE) # convert to long format
setorder(dflong, value1)      # reorder by date value
dflong <- unique(dflong, by = "row", fromLast = TRUE) # get the latest dates
df[dflong, result := i.value2, on = "row"]  # add result to original data

df
#    a      date1  b      date2  c      date3 row result
#1: NA 2016-03-01 50 2016-02-01 10 2016-01-01   1     50
#2: 20 2016-02-01 NA 2016-03-01 10 2016-01-01   2     20
#3: NA 2016-02-01 NA 2016-03-01 10 2016-01-01   3     10
Maeve answered 28/8, 2017 at 13:41 Comment(1)
Thanks ! I'm impressed by the speed of the algorithm thanks to the use of data.table objectObliterate
S
3

This ought to handle it. First we put the data in tidy form (1 row for each Date, Value, along with a row_num to identify which example the tidy row belongs to). Then we filter out NAs, group_by row_num, order by Date descending, and take the first row.

df %>%
  mutate(row_num = row_number()) %>%
  unite(a, a, date1) %>%
  unite(b, b, date2) %>%
  unite(c, c, date3) %>%
  gather(key, value, -row_num) %>%
  select(-key) %>%
  separate(value, into=c("Value", "Date"), sep = "_") %>%
  mutate(Date = as.Date(Date)) %>%
  filter(Value != "NA") %>%
  group_by(row_num) %>%
  top_n(1, Date) %>%
  ungroup()

Results

Synergism answered 28/8, 2017 at 13:45 Comment(0)
N
2

Here is one way to do it...

df$result <- apply(df, 1, function(x){
  dates <- as.Date(x[seq(2, length(x), 2)])
  values <- x[seq(1,length(x),2)]
  return(values[!is.na(values)][which.max(dates[!is.na(values)])])
})

df
   a      date1  b      date2  c      date3 result
1 NA 2016-03-01 50 2016-02-01 10 2016-01-01     50
2 20 2016-02-01 NA 2016-03-01 10 2016-01-01     20
3 NA 2016-02-01 NA 2016-03-01 10 2016-01-01     10
Nureyev answered 28/8, 2017 at 13:37 Comment(0)
M
1

Here one more approach:

df$row <- 1:nrow(df)

gather(df, key, date_val, date1, date2, date3, -row) %>% 
   select(-key) %>% 
   gather(key, val, a,b,c) %>% 
   filter(!is.na(val)) %>% 
   group_by(row) %>% 
   mutate(max_date = max(date_val)) %>% 
   filter(date_val == max_date) %>% summarise(result = max(val)) %>% 
   left_join(df, by="row") %>% select(-row)

# A tibble: 3 × 7
  result     a      date1     b      date2     c      date3
   <dbl> <dbl>     <fctr> <dbl>     <fctr> <dbl>     <fctr>
1     50    NA 2016-03-01    50 2016-02-01    10 2016-01-01
2     20    20 2016-02-01    NA 2016-03-01    10 2016-01-01
3     10    NA 2016-02-01    NA 2016-03-01    10 2016-01-01
Mellifluent answered 28/8, 2017 at 13:48 Comment(0)
C
1

Another base alternative:

df$id <- 1:nrow(df)
d2 <- reshape(df, varying = list(seq(1, by = 2, len = (ncol(df) - 1)/2),
                                 seq(2, by = 2, len = (ncol(df) - 1)/2)),
              direction = "long")

d2 <- with(d2, d2[order(-id, date1, decreasing = TRUE), ])

cbind(df, res = tapply(d2$a[!is.na(d2$a)], d2$id[!is.na(d2$a)], `[`, 1)) 
#    a      date1  b      date2  c      date3 id res
# 1 NA 2016-03-01 50 2016-02-01 10 2016-01-01  1  50
# 2 20 2016-02-01 NA 2016-03-01 10 2016-01-01  2  20
# 3 NA 2016-02-01 NA 2016-03-01 10 2016-01-01  3  10
Constellate answered 28/8, 2017 at 14:10 Comment(0)
M
0

The party was over, but I just came across this post and decided to leave the following. The idea was that I rather wanted to create a data frame and do the job.

out <- data.frame(group = 1:nrow(df),
                  date = as.Date(unlist(df[, grep(x = names(df), "[1-9]")]),
                                 "%Y-%m-%d"),
                  result = unlist(df[nchar(names(df)) == 1])) %>%
       filter(complete.cases(.)) %>%
       group_by(group) %>%
       slice(which.max(date)) %>%
       ungroup

cbind(df, result = out$result)

#   a      date1  b      date2  c      date3 result
#1 NA 2016-03-01 50 2016-02-01 10 2016-01-01     50
#2 20 2016-02-01 NA 2016-03-01 10 2016-01-01     20
#3 NA 2016-02-01 NA 2016-03-01 10 2016-01-01     10

If I use data.table, I would do the following, which is based on the answer of docendo.

setDT(df)[, row := .I]

out <- melt(df, id = "row", measure = patterns("^date", "^(a|b|c)"), 
            value.name = c("date", "result"), na.rm = TRUE) [, 
                 date := as.Date(date, "%Y-%m-%d")][,
                     .SD[which.max(date)], by = row][, c("row", "result")]

df[out, on = "row"]

#    a      date1  b      date2  c      date3 row result
#1: 20 2016-02-01 NA 2016-03-01 10 2016-01-01   2     20
#2: NA 2016-03-01 50 2016-02-01 10 2016-01-01   1     50
#3: NA 2016-02-01 NA 2016-03-01 10 2016-01-01   3     10
Mellicent answered 1/9, 2017 at 6:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.