Mutate the column name from the corresponding selected value among many columns

Asked 20/9, 2024 at 8:6 Answered 21/9, 2024 at 12:45

Let's consider markers with their coefficient of variation (cv) and three reference cv (rcv):

Initial data:

  marker    cv  rcv1  rcv2  rcv3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 AAA        7    10     8     5
2 BBB        4     5     3     1
3 CCC       11    20    15    12
4 DDD        8     7     5     2

I would like to mutate three new variables:

rcv_value: the closest rcv value greater than the cv
rcv_name: the column name of that rcv_value
cv_conclusion:
- ok if the cv is lower than one or the other of the rcvs
- ko if the cv is higher than the highest rcv

Desired output:

  marker    cv  rcv1  rcv2  rcv3 rcv_value rcv_name cv_conclusion
  <chr>  <dbl> <dbl> <dbl> <dbl>     <dbl> <chr>    <chr>        
1 AAA        7    10     8     5         8 rcv2     ok           
2 BBB        4     5     3     1         5 rcv1     ok           
3 CCC       11    20    15    12        12 rcv3     ok           
4 DDD        8     7     5     2         7 rcv1     ko

NB: my real data has more than 100 markers and about 10 different rcv.
Where I fail is getting the rcv_name from the corresponding rcv_value (using mutate and case_when).

Thanks for help.

Data:

dat0 <-
structure(list(marker = c("AAA", "BBB", "CCC", "DDD"), cv = c(7, 
4, 11, 8), rcv1 = c(10, 5, 20, 7), rcv2 = c(8, 3, 15, 5), rcv3 = c(5, 
1, 12, 2)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))

Selenite answered 20/9, 2024 at 8:6 Comment(0)

You can use this tidyverse manner that combines dplyr::mutate and purrr::pmap:

library(dplyr)

dat0 %>%
  mutate(
    purrr::pmap_dfr(pick(cv, rcv1:rcv3), ~ {
      x <- c(...)[-1]
      tibble::enframe(x[order(x < ..1, abs(x - ..1), -x)][1],
                      name = "rcv_name", value = "rcv_value")
    }), cv_conclusion = ifelse(rcv_value >= cv, "ok", "ko")
  )

or the following dplyr-only alternative:

dat0 %>%
  rowwise() %>% 
  mutate(rcv_value = {
    x <- c_across(rcv1:rcv3)
    x[order(x < cv, abs(x - cv), -x)][1]
  }) %>%
  ungroup() %>%
  mutate(
    rcv_name = do.call(coalesce, across(rcv1:rcv3, ~ ifelse(.x == rcv_value, cur_column(), NA))),
    cv_conclusion = ifelse(rcv_value >= cv, "ok", "ko")
  )

Output

# # A tibble: 4 × 8
#   marker    cv  rcv1  rcv2  rcv3 rcv_value rcv_name cv_conclusion
#   <chr>  <dbl> <dbl> <dbl> <dbl>     <dbl> <chr>    <chr>        
# 1 AAA        7    10     8     5         8 rcv2     ok           
# 2 BBB        4     5     3     1         5 rcv1     ok           
# 3 CCC       11    20    15    12        12 rcv3     ok           
# 4 DDD        8     7     5     2         7 rcv1     ko

Amplitude answered 20/9, 2024 at 8:45 Comment(0)

This is a case where reshaping to long makes it much more straightforward. First use tidyr::pivot_longer(), then dplyr::mutate(.by = marker) to perform these operations by group, and then tidyr::pivot_wider() to return the data to the original shape.

library(dplyr)
library(tidyr)

dat0 |>
    pivot_longer(-c(marker, cv)) |>
    mutate(
        rcv_value = min(value[value > cv]),
        # rcv_value will be Inf if no values > cv
        cv_conclusion = if_else(is.infinite(rcv_value), "ko", "ok"),
        rcv_value = if_else(is.infinite(rcv_value), max(value), rcv_value),
        rcv_name = name[rcv_value == value],
        .by = marker
    ) |>
    # reshape back to wide
    pivot_wider(id_cols = c(marker, cv, rcv_value, rcv_name, cv_conclusion)) |>
    # reorder columns as desired
    relocate(marker, cv, rcv1:rcv3, rcv_value:cv_conclusion)

# # A tibble: 4 × 8
#   marker    cv  rcv1  rcv2  rcv3 rcv_value rcv_name cv_conclusion
#   <chr>  <dbl> <dbl> <dbl> <dbl>     <dbl> <chr>    <chr>        
# 1 AAA        7    10     8     5         8 rcv2     ok           
# 2 BBB        4     5     3     1         5 rcv1     ok           
# 3 CCC       11    20    15    12        12 rcv3     ok           
# 4 DDD        8     7     5     2         7 rcv1     ko

A note on ties

If you can have reference cv ties (i.e. repeated values in the same row in rcv1:rcv3 columns) you'll have to specify which you want rcv_name to include. If you just want the first value you could change that line to rcv_name = name[rcv_value == value][1]. Alternatively, if you are doing further processing you could create a list column to store all the values, e.g. list(name[rcv_value == value]) (though I might prefer to just keep the table in long form). Alternatively, if this table is an output, you might want to paste them together e.g. rcv_name = paste(name[rcv_value == value], collapse = ",") to create results in the column such as "rcv2,rcv3".

Vicky answered 20/9, 2024 at 8:23 Comment(3)

That warning message is annoying. I sometimes use the min_ function from the hablar package to circumvent that. Or write my own custom function. – Utter 21/9, 2024 at 2:47

@Utter thanks - I didn't know about that function. I can live with a warning though in this context. – Vicky 21/9, 2024 at 6:17

Thanks for all these excellent solutions. A preference for the one using pivot_longer and pivot_wider because for my real data, it offers more flexibility for other operations and manipulations beyond the present case. – Selenite 21/9, 2024 at 9:23

You can try the following base R option, using max.col and colMeans

rcvcols <- startsWith(names(dat0), "rcv")
u <- abs((d <- dat0[rcvcols] - dat0$cv) / (d > 0))
idx <- max.col(-u, "first")
dat0$rcv_value <- as.matrix(dat0[rcvcols])[cbind(1:nrow(u), idx)]
dat0$rcv_name <- names(u)[idx]
dat0$cv_conclusion <- c("ko", "ok")[1 + (colMeans(is.infinite(t(u))) < 1)]

which gives

> dat0
# A tibble: 4 × 8
  marker    cv  rcv1  rcv2  rcv3 rcv_value rcv_name cv_conclusion
  <chr>  <dbl> <dbl> <dbl> <dbl>     <dbl> <chr>    <chr>
1 AAA        7    10     8     5         8 rcv2     ok
2 BBB        4     5     3     1         5 rcv1     ok
3 CCC       11    20    15    12        12 rcv3     ok
4 DDD        8     7     5     2         7 rcv1     ko

Tardiff answered 21/9, 2024 at 12:45 Comment(2)

Wow! Super concise, but what about the ‘rcv_value’? – Utter 21/9, 2024 at 14:57

@Utter aha, sorry that I overlooked that term. Thanks for your reminder and please see my update. – Tardiff 21/9, 2024 at 20:22

With a user-defined function:

RCV <- function(dat) {
  library(Rfast)
  dat <- data.frame(dat)
  rcv <- grep('rcv', names(dat))
  M <- as.matrix(dat[,rcv] - dat$cv)
  M[M<0] <- NA
  rmins <- rowMins(M)
  dat$rcv_value <- dat[,rcv][cbind(1:nrow(dat), rmins)]
  dat$rcv_name <- names(dat[,rcv])[rmins]
  dat$cv_conclusion <- ifelse(dat0$cv < dat$rcv_value, "ok", "ko")
  dat
}

RCV(dat0)
___
  marker cv rcv1 rcv2 rcv3 rcv_value rcv_name cv_conclusion
1    AAA  7   10    8    5         8     rcv2            ok
2    BBB  4    5    3    1         5     rcv1            ok
3    CCC 11   20   15   12        12     rcv3            ok
4    DDD  8    7    5    2         7     rcv1            ko

Utter answered 20/9, 2024 at 9:9 Comment(2)

I like the use of Rfast but I think you should use library() rather than require() here (and basically everywhere) - see What is the difference between require() and library()?. Also the tidyverse style guide suggests If your script uses add-on packages, load them all at once at the very beginning of the file. Of course you don't need to follow it but I think it's sensible. – Vicky 20/9, 2024 at 9:52

Thanks @SamR! I agree entirely with what you said. Updated. – Utter 21/9, 2024 at 2:39

Output

A note on ties

Recommended topics

Hot tags