dynamically extract elements from list column
Asked Answered
N

3

5

I have the following data:

df <- structure(list(id = c("1358792", "1358792", "333482", "333482", "747475", "747475"),
                     x = c("123", "123", "456", "456", NA, NA),
                     all_x = list("123", "123",
                                  c("456", "789"),
                                  c("456", "789"),
                                  list(),
                                  list())),
                row.names = c(NA, -6L),
                class = "data.frame")
    
       id    x    all_x
1 1358792  123      123
2 1358792  123      123
3  333482  456 456, 789
4  333482  456 456, 789
5  747475 <NA>     NULL
6  747475 <NA>     NULL

The all_x column is a list with either an EMPTY/NULL value, a single character or a character vector.

I want to create a new column (tidyverse style) with the following logic: when the all_x column has one or no value, just take the value from x. If it has two values (i.e. is a character vector), we want to group by id and take the element that corresponds to the row number, i.e. for the first id value, take the first element of the character vector, for the second id element, take the second character value and so on.

Desired output would be an additional character column with the respective values, i.e.

       id    x    all_x   x2
1 1358792  123      123  123
2 1358792  123      123  123
3  333482  456 456, 789  456
4  333482  456 456, 789  789
5  747475 <NA>     NULL <NA>
6  747475 <NA>     NULL <NA>

I have tried tons of variants with if_else, ifelse and unlisting and indexing, but still always get errors due to the mixed structure of the all_x column.

Here's the closest I got:

library(tidyverse)
df |>
  mutate(x2 = if_else(lengths(all_x) > 1, all_x[[1]][row_number()], x), .by = id)

However, obviously, I'm not successful.

Namnama answered 18/10, 2024 at 8:52 Comment(5)
Are the duplicate rows intentional?Convection
Yes, they are..Namnama
Why is it desired to keep all_x as a list column? Wouldn't it be better to make it character by doing lapply(all_x, toString)? In other words, I would appreciate a reference to why and when keeping a list column is considered to be good practice?Pulque
Technically, I'm only interested in the charater information, which is shown in my desired output in the column x2. However, running lapply(all_x, toString) still keeps the column as a list column and working with the code still leads to these type-inconsistency errors.Namnama
Ok. I recommend to change the structure then. Used just a shortcut, good practice might be df0$all_x = vapply(df0$all_x, toString, character(1L)).Pulque
K
2

I think you should use ifelse (rather than if_else), with the minimal effort to make it fly

> df |>
+     mutate(x2 = ifelse(lengths(all_x) > 1, all_x[[1]][row_number()], x), .by = id)
       id    x    all_x   x2
1 1358792  123      123  123
2 1358792  123      123  123
3  333482  456 456, 789  456
4  333482  456 456, 789  789
5  747475 <NA>     NULL <NA>
6  747475 <NA>     NULL <NA>

Note: the difference between if_else and ifelse

Kibitz answered 18/10, 2024 at 9:41 Comment(2)
Great, this works. It seems that I missed this obvious version when permutating through a lot of options. one question, though, when using if_else, I somehow understood the comments regarding type-stability, and tried to fix these by e.g. wrapping the all_x in an unlist command or the other way around, i.e. wrapping the false condition into a list command. However, none of these worked, but I'd be still interested in how to create such type stability to be able to use if_else.Namnama
@Namnama to be honest, I rarely use if_else and sorry that I have no idea about the workaround on top of if_elseKibitz
F
2

Here is a function that will either return x when all_x has one value or none, or otherwise take the element from all_x that corresponds to the row number:

get_x2 <- function(x, all_x, row_num) {
    if (length(all_x) <= 1) {
        return(x)
    }
    all_x[[row_num]]
}

Then it's just a case of creating the row_number() by id and using Map():

df |>
    mutate(row_num = row_number(), .by = id) |>
    mutate(
        x2 = Map(get_x2, x, all_x, row_num)
    )
#        id    x    all_x row_num  x2
# 1 1358792  123      123       1 123
# 2 1358792  123      123       2 123
# 3  333482  456 456, 789       1 456
# 4  333482  456 456, 789       2 789
# 5  747475 <NA>     NULL       1  NA
# 6  747475 <NA>     NULL       2  NA
Flemish answered 18/10, 2024 at 9:19 Comment(2)
Thanks. Could you please elaborate a bit on how this works? I'm confused about the usage of length instead of lengths in the function call.Namnama
@Namnama it's basically iterating over the all_x list and applying the function to each value. If your data is really large this will be slow - but I assume if speed rather than expressiveness was the primary concern then you wouldn't be looking for a tidyverse solution.Flemish
K
2

I think you should use ifelse (rather than if_else), with the minimal effort to make it fly

> df |>
+     mutate(x2 = ifelse(lengths(all_x) > 1, all_x[[1]][row_number()], x), .by = id)
       id    x    all_x   x2
1 1358792  123      123  123
2 1358792  123      123  123
3  333482  456 456, 789  456
4  333482  456 456, 789  789
5  747475 <NA>     NULL <NA>
6  747475 <NA>     NULL <NA>

Note: the difference between if_else and ifelse

Kibitz answered 18/10, 2024 at 9:41 Comment(2)
Great, this works. It seems that I missed this obvious version when permutating through a lot of options. one question, though, when using if_else, I somehow understood the comments regarding type-stability, and tried to fix these by e.g. wrapping the all_x in an unlist command or the other way around, i.e. wrapping the false condition into a list command. However, none of these worked, but I'd be still interested in how to create such type stability to be able to use if_else.Namnama
@Namnama to be honest, I rarely use if_else and sorry that I have no idea about the workaround on top of if_elseKibitz
T
2

Working with NULL is always a bit clunky and mostly only works in a list context. You can replace NULL with NA, this way nothing can complain about NULL in a non-list context

df %>% 
  mutate(x2 = replace(all_x, lengths(all_x) == 0, NA), 
         x2 = unlist(x2)[row_number()], .by = id)
       id    x    all_x   x2
1 1358792  123      123  123
2 1358792  123      123  123
3  333482  456 456, 789  456
4  333482  456 456, 789  789
5  747475 <NA>     NULL <NA>
6  747475 <NA>     NULL <NA>
Tristich answered 18/10, 2024 at 18:11 Comment(1)
I was about to flag this one as the accepted answer, because I think I understand it a bit better than the misteries around if_else/ifelse. However, with my huge data set, it takes somewhat longer than other approaches (~2-3 times slower). Still a great solution!Namnama

© 2022 - 2025 — McMap. All rights reserved.