How to row bind dataframes stored in a multi-level nested list, and add an identifier column for each level?

Asked 19/12, 2023 at 18:1 Answered 19/12, 2023 at 19:16

Solved r dataframe dplyr purrr nested-lists

Beginner in data manipulation in R, I struggle with multi-level nested lists.

Question: Is there a way to transform this dat0 3-levels list into the global dat1 dataframe below?

The new fulltext column concatenates the text variable from each tibble.
the new nbsum column adds the nb variable from each tibble.

Note: a purrr-based approach with dplyr functions (mutate...) would be welcome to better understand these tools in this particular context. Other approaches are also welcome!

Thanks for help

Initial data:

dat0 <- list(pdf1 =
               list(page1 =
                      list(tibble1 = tibble(x = c(1,2,3,4), y = c(1,1,1,1), text = c("ha","r","r","y"), nb = c(1,2,3,4)),
                           tibble2 = tibble(x = c(1,2,3,4), y = c(2,2,2,2), text = c("p","ot","t","er"), nb = c(1,2,3,4))),
                    page2 = 
                      list(tibble1 = tibble(x = c(1,2,3), y = c(3,3,3), text = c("her","m","ione"), nb = c(1,2,3)),
                           tibble2 = tibble(x = c(1,2,3), y = c(4,4,4), text = c("gra","ng","er"), nb = c(1,2,3)))),
             pdf2 =
               list(page1 =
                      list(tibble1 = tibble(x = c(1,2), y = c(5,5), text = c("vol","de"), nb = c(1,2)),
                           tibble2 = tibble(x = c(1,2), y = c(6,6), text = c("m","ort"), nb = c(1,2))),
                    page2 =
                      list(tibble1 = tibble(x = c(1,2,3,4,5), y = c(7,7,7,7,7), text = c("a","l","b","u","s"), nb = c(1,2,3,4,5)),
                           tibble2 = tibble(x = c(1,2,3,4,5), y = c(8,8,8,8,8), text = c("du","m","ble","do","re"), nb = c(1,2,3,4,5))),
                    page3 = 
                      list(tibble1 = tibble(x = c(1,2,3,4), y = c(9,9,9,9), text = c("dr","a","g","o"), nb = c(1,2,3,4)),
                           tibble2 = tibble(x = c(1,2,3,4), y = c(10,10,10,10), text = c("ma","lf","o","y"), nb = c(1,2,3,4)))),
             pdf3 =
               list(page1 =
                      list(tibble1 = tibble(x = c(1,2,3,4,5), y = c(11,11,11,11,11), text = c("s","ev","e","ru","s"), nb = c(1,2,3,4,5)),
                           tibble2 = tibble(x = c(1,2,3,4,5), y = c(12,12,12,12,12), text = c("r","o","g","u","e"), nb = c(1,2,3,4,5))),
                    page2 =
                      list(tibble1 = tibble(x = c(1,2,3), y = c(13,13,13), text = c("r","o","n"), nb = c(1,2,3)),
                           tibble2 = tibble(x = c(1,2,3), y = c(14,14,14), text = c("we","as","ley"), nb = c(1,2,3))),
                    page3 =
                      list(tibble1 = tibble(x = c(1,2,3,4,5,6), y = c(15,15,15,15,15,15), text = c("be","l","la","t","ri","x"), nb = c(1,2,3,4,5,6)),
                           tibble2 = tibble(x = c(1,2,3,4,5,6), y = c(16,16,16,16,16,16), text = c("l","est","r","a","ng","e"), nb = c(1,2,3,4,5,6))),
                    page4 = 
                      list(tibble1 = tibble(x = c(1,2), y = c(17,17), text = c("sir","ius"), nb = c(1,2)),
                           tibble2 = tibble(x = c(1,2), y = c(18,18), text = c("bl","ack"), nb = c(1,2)))))

Desired output (constructed laboriously; dput script below):

dat1 <-
structure(list(pdf = c("pdf1", "pdf1", "pdf1", "pdf1", "pdf1", 
"pdf1", "pdf1", "pdf1", "pdf1", "pdf1", "pdf1", "pdf1", "pdf1", 
"pdf1", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", 
"pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", 
"pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3"), page = c("page1", 
"page1", "page1", "page1", "page1", "page1", "page1", "page1", 
"page2", "page2", "page2", "page2", "page2", "page2", "page1", 
"page1", "page1", "page1", "page2", "page2", "page2", "page2", 
"page2", "page2", "page2", "page2", "page2", "page2", "page3", 
"page3", "page3", "page3", "page3", "page3", "page3", "page3", 
"page1", "page1", "page1", "page1", "page1", "page1", "page1", 
"page1", "page1", "page1", "page2", "page2", "page2", "page2", 
"page2", "page2", "page3", "page3", "page3", "page3", "page3", 
"page3", "page3", "page3", "page3", "page3", "page3", "page3", 
"page4", "page4", "page4", "page4"), x = c(1, 2, 3, 4, 1, 2, 
3, 4, 1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 
5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 
3, 1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 1, 2), 
    y = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 
    6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 
    10, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 14, 
    14, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 
    17, 18, 18), text = c("ha", "r", "r", "y", "p", "ot", "t", 
    "er", "her", "m", "ione", "gra", "ng", "er", "vol", "de", 
    "m", "ort", "a", "l", "b", "u", "s", "du", "m", "ble", "do", 
    "re", "dr", "a", "g", "o", "ma", "lf", "o", "y", "s", "ev", 
    "e", "ru", "s", "r", "o", "g", "u", "e", "r", "o", "n", "we", 
    "as", "ley", "be", "l", "la", "t", "ri", "x", "l", "est", 
    "r", "a", "ng", "e", "sir", "ius", "bl", "ack"), nb = c(1, 
    2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 
    3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 
    4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5, 6, 
    1, 2, 3, 4, 5, 6, 1, 2, 1, 2), fulltext = c("harry", "harry", 
    "harry", "harry", "potter", "potter", "potter", "potter", 
    "hermione", "hermione", "hermione", "granger", "granger", 
    "granger", "volde", "volde", "mort", "mort", "albus", "albus", 
    "albus", "albus", "albus", "dumbledore", "dumbledore", "dumbledore", 
    "dumbledore", "dumbledore", "drago", "drago", "drago", "drago", 
    "malfoy", "malfoy", "malfoy", "malfoy", "severus", "severus", 
    "severus", "severus", "severus", "rogue", "rogue", "rogue", 
    "rogue", "rogue", "ron", "ron", "ron", "weasley", "weasley", 
    "weasley", "bellatrix", "bellatrix", "bellatrix", "bellatrix", 
    "bellatrix", "bellatrix", "lestrange", "lestrange", "lestrange", 
    "lestrange", "lestrange", "lestrange", "sirius", "sirius", 
    "black", "black"), nbsum = c(10, 10, 10, 10, 10, 10, 10, 
    10, 6, 6, 6, 6, 6, 6, 3, 3, 3, 3, 15, 15, 15, 15, 15, 15, 
    15, 15, 15, 15, 10, 10, 10, 10, 10, 10, 10, 10, 15, 15, 15, 
    15, 15, 15, 15, 15, 15, 15, 6, 6, 6, 6, 6, 6, 21, 21, 21, 
    21, 21, 21, 21, 21, 21, 21, 21, 21, 3, 3, 3, 3)), row.names = c(NA, 
-68L), class = "data.frame")

Origen answered 19/12, 2023 at 18:1 Comment(0)

The most flexible function for this task is IMO collapse::unlist2d:

library(dplyr)
dat2 <- 
  collapse::unlist2d(dat0, idcols = c("pdf", "page", "tibble")) |> 
  mutate(fulltext = paste(text, collapse = ""), 
         nbsum = sum(nb),
         .by = c(pdf, page, tibble)) |> 
  select(-tibble)

identical(dat1, dat2)
#[1] TRUE

Scenarist answered 19/12, 2023 at 18:8 Comment(2)

Very effective, thank you Maël! But how to create the 2 new variables 'fulltext' and 'nbsum'? This is my main difficulty... – Origen 19/12, 2023 at 18:14

You can use dplyr there, with some grouping, no difficulties! – Pheasant 19/12, 2023 at 18:18

Here's a pure tidyverse solution:

library(dplyr)
library(purrr)

map_dfr(dat0, ~ map_dfr(.x, 
                        ~bind_rows(.x), 
                        .id = 'page'),
        .id = 'pdf') %>% 
  mutate(fulltext = paste(text, collapse=""),
         nbsum = sum(nb),
         .by = y)

#> # A tibble: 68 × 8
#>    pdf   page      x     y text     nb fulltext nbsum
#>    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#>  1 pdf1  page1     1     1 ha        1 harry       10
#>  2 pdf1  page1     2     1 r         2 harry       10
#>  3 pdf1  page1     3     1 r         3 harry       10
#>  4 pdf1  page1     4     1 y         4 harry       10
#>  5 pdf1  page1     1     2 p         1 potter      10
#>  6 pdf1  page1     2     2 ot        2 potter      10
#>  7 pdf1  page1     3     2 t         3 potter      10
#>  8 pdf1  page1     4     2 er        4 potter      10
#>  9 pdf1  page2     1     3 her       1 hermione     6
#> 10 pdf1  page2     2     3 m         2 hermione     6
#> # ℹ 58 more rows

We can also use rrapply package:

library(rrapply)
library(dplyr)
library(purrr)
library(tidyr)

rrapply(dat0, how = 'bind', options = list(namecols = TRUE)) %>% 
  mutate(fulltext = map_chr(text, .f = paste, collapse=""), 
         nbsum = map_dbl(nb, .f = sum), 
         L3 = NULL) %>% 
  unnest(cols = everything()) %>% 
  rename(setNames(c("L1", "L2"), c("pdf", "page")))

#> # A tibble: 68 × 8
#>    pdf   page      x     y text     nb fulltext nbsum
#>    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#>  1 pdf1  page1     1     1 ha        1 harry       10
#>  2 pdf1  page1     2     1 r         2 harry       10
#>  3 pdf1  page1     3     1 r         3 harry       10
#>  4 pdf1  page1     4     1 y         4 harry       10
#>  5 pdf1  page1     1     2 p         1 potter      10
#>  6 pdf1  page1     2     2 ot        2 potter      10
#>  7 pdf1  page1     3     2 t         3 potter      10
#>  8 pdf1  page1     4     2 er        4 potter      10
#>  9 pdf1  page2     1     3 her       1 hermione     6
#> 10 pdf1  page2     2     3 m         2 hermione     6
#> # ℹ 58 more rows

^{Created on 2023-12-20 with reprex v2.0.2}

Terrel answered 19/12, 2023 at 19:16 Comment(5)

Great! examples using map_dfr are not that common, so thank you very much M-- – Origen 19/12, 2023 at 22:24

note that with rrapply(..., how = "bind"), you can set options = list(namecols = TRUE) to include the pdf and page columns in the result as well – Prewitt 20/12, 2023 at 8:14

@JorisC. I actually was expecting an option like that, but missed it when skimming the docs. Thanks for the comment, I will update my answer in the morning. – Terrel 20/12, 2023 at 8:47

Isn't map_dfr "superseded" and on a deprecation path? Perhaps that's too bad, as that map_dfr solution is pretty nice. – Filip 20/12, 2023 at 15:6

@GregorThomas it's superseded, so it's not on deprecation path; "A superseded function will not emit a warning (since there’s no risk if you keep using it), but the documentation will tell you what we recommend instead ." – Terrel 20/12, 2023 at 15:56

I believe this works:

library(dplyr)
library(tidyr)
dat0 |>
  bind_rows(.id = "pdf") |>
  pivot_longer(starts_with("page"), names_to = "page") |>
  unnest(value) |>
  arrange(pdf, page, y, x) |>
  mutate(fulltext = paste(text, collapse = ""), nbsum = sum(nb), .by = c(pdf, page, y))
# # A tibble: 68 × 8
#    pdf   page      x     y text     nb fulltext nbsum
#    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#  1 pdf1  page1     1     1 ha        1 harry       10
#  2 pdf1  page1     2     1 r         2 harry       10
#  3 pdf1  page1     3     1 r         3 harry       10
#  4 pdf1  page1     4     1 y         4 harry       10
#  5 pdf1  page1     1     2 p         1 potter      10
#  6 pdf1  page1     2     2 ot        2 potter      10
#  7 pdf1  page1     3     2 t         3 potter      10
#  8 pdf1  page1     4     2 er        4 potter      10
#  9 pdf1  page2     1     3 her       1 hermione     6
# 10 pdf1  page2     2     3 m         2 hermione     6
# # ℹ 58 more rows
# # ℹ Use `print(n = ...)` to see more rows

Filip answered 19/12, 2023 at 18:29 Comment(6)

Thank you to all of you! Maël's approach is shorter, simpler and intuitive for a novice, Allan's is a bit longer but instructive for understanding the use of lapply in multi-nested levels, and Gregor's is original and just as effective. The pending question is: would this script be optimizable with a purrr-based alternative? – Origen 19/12, 2023 at 18:53

What do you mean be "optimizable"? Are you concerned about run time, memory usage? The collapse package Maël uses is about as high-performance as packages get. purrr isn't so much about efficiency as readability. – Filip 19/12, 2023 at 19:17

I misspoke; I wanted to say: what is the most effective script using purrr (even if more complex than Maël's) to meet the objective. I actually spent hours trying with purrr, unfortunately without success. If someone provides such a script, it would surely help many beginners to understand this tool, starting with me. – Origen 19/12, 2023 at 19:58

Well, you could use Allan's lapply approach and just change each lapply to purrr::map. There might be a simplification using list_rbind, I'm not really sure how it's different from bind_rows. – Filip 19/12, 2023 at 20:2

I'd also mention that you have 4 good answers now, and none of them use purrr at all. So maybe purrr isn't a great tool for this job. – Filip 19/12, 2023 at 20:22

I added a solution using purrr::map_dfr. That is actually cleaner than using 4 lapply or map. If we had a requirement to stick to tidyverse functions only, that'd be our best option. Cheers. cc: @Origen p.s. "purrr isn't so much about efficiency as readability" <-- I second that. – Terrel 19/12, 2023 at 22:4

If you want to create the new variables as you go, you could probably just do this using lapply, since your data is not too deeply nested:

lapply(dat0, unlist, recursive = FALSE) |>
  lapply(function(x) {
    lapply(x, \(x) x %>% 
             mutate(fulltext = paste(text, collapse = ''),
                    nbsum = sum(nb)))}) %>% 
  lapply(dplyr::bind_rows, .id = 'page') %>%
  dplyr::bind_rows(.id = 'pdf') %>%
  mutate(page = gsub('\\..*', '', page))
#> # A tibble: 68 x 8
#>    pdf   page      x     y text     nb fulltext nbsum
#>    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#>  1 pdf1  page1     1     1 ha        1 harry       10
#>  2 pdf1  page1     2     1 r         2 harry       10
#>  3 pdf1  page1     3     1 r         3 harry       10
#>  4 pdf1  page1     4     1 y         4 harry       10
#>  5 pdf1  page1     1     2 p         1 potter      10
#>  6 pdf1  page1     2     2 ot        2 potter      10
#>  7 pdf1  page1     3     2 t         3 potter      10
#>  8 pdf1  page1     4     2 er        4 potter      10
#>  9 pdf1  page2     1     3 her       1 hermione     6
#> 10 pdf1  page2     2     3 m         2 hermione     6
#> # i 58 more rows
#> # i Use `print(n = ...)` to see more rows

Ce answered 19/12, 2023 at 18:17 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags