Use Group by with mutate, case_when, any() and all() function in R
Asked Answered
M

2

5

I have a status_df with id and status at each stage:

id stage status
15 1 Pending
15 2 Not Sent
16 1 Approved
16 2 Rejected
16 3 Not Sent
16 4 Not Sent
20 1 Approved
20 2 Approved
20 3 Approved

I am trying to do a group_by ID and apply the following logic:

  • if any stage for an ID has 'Pending' status, final_status column is 'Pending'
  • if any stage for an ID has 'Rejected' status, final_status column is 'Rejected'
  • if all stages for an ID are approved, final_status column is 'Approved'

I am trying this (not working):

final_status_df = status_df %>% select(id, status) %>% group_by(id) %>%
mutate(final_status = case_when(any(status)=="Pending" ~ "Pending",
any(status)=="Rejected" ~ "Rejected", 
all(status)=="Approved" ~ "Approved"))

Expected output (final_status_df):

id final_status
15 Pending
16 Rejected
20 Approved
Meloniemelony answered 3/2, 2021 at 23:59 Comment(0)
C
7

You were in the right direction with your attempt however, you closed any/all brackets early before comparison (==). Also since you only want 1 row for every id you can use summarise instead of mutate which will also avoid the use of select.

library(dplyr)

status_df %>% 
  group_by(id) %>%
  summarise(final_status = case_when(any(status == "Pending") ~ "Pending",
                                     any(status == "Rejected") ~ "Rejected", 
                                     all(status == "Approved") ~ "Approved"))

#    id final_status
#* <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    
Choanocyte answered 4/2, 2021 at 3:46 Comment(1)
Thank you. I was getting the error that any() takes logical relation, but I couldn't understand it was because I was placing the brackets wrong.Meloniemelony
L
2

We can use summariseinstead of mutate (as mutate returns the output column with the same length as the input column and it is used to create/modify a column instead of summarising).

Also, an easier option is to convert to factor with levels specified in the custom order, drop the unused levels (droplevels) and select the first levels after grouping by 'id'

library(dplyr)
status_df %>%
    group_by(id) %>%
    summarise(final_status = first(levels(droplevels(factor(status, 
          levels = c("Pending", "Rejected", "Approved"))))), .groups = 'drop')

-output

# A tibble: 3 x 2
#     id final_status
#  <int> <chr>       
#1    15 Pending     
#2    16 Rejected    
#3    20 Approved    

In the OP's code, any(status) returns NA, instead it should be wrapped on a logical vector i.e. any(status == "Pending"). Also, as mentioned above, it would be summarise instead of mutate

data

status_df <- structure(list(id = c(15L, 15L, 16L, 16L, 16L, 16L, 20L, 20L, 
20L), stage = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), status = c("Pending", 
"Not Sent", "Approved", "Rejected", "Not Sent", "Not Sent", "Approved", 
"Approved", "Approved")), class = "data.frame", row.names = c(NA, 
-9L))
Lilley answered 4/2, 2021 at 0:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.