How to row bind dataframes stored in a multi-level nested list, and add an identifier column for each level?
Asked Answered
O

4

7

Beginner in data manipulation in R, I struggle with multi-level nested lists.

Question: Is there a way to transform this dat0 3-levels list into the global dat1 dataframe below?

  • The new fulltext column concatenates the text variable from each tibble.
  • the new nbsum column adds the nb variable from each tibble.

Note: a purrr-based approach with dplyr functions (mutate...) would be welcome to better understand these tools in this particular context. Other approaches are also welcome!

Thanks for help

Initial data:

enter image description here

dat0 <- list(pdf1 =
               list(page1 =
                      list(tibble1 = tibble(x = c(1,2,3,4), y = c(1,1,1,1), text = c("ha","r","r","y"), nb = c(1,2,3,4)),
                           tibble2 = tibble(x = c(1,2,3,4), y = c(2,2,2,2), text = c("p","ot","t","er"), nb = c(1,2,3,4))),
                    page2 = 
                      list(tibble1 = tibble(x = c(1,2,3), y = c(3,3,3), text = c("her","m","ione"), nb = c(1,2,3)),
                           tibble2 = tibble(x = c(1,2,3), y = c(4,4,4), text = c("gra","ng","er"), nb = c(1,2,3)))),
             pdf2 =
               list(page1 =
                      list(tibble1 = tibble(x = c(1,2), y = c(5,5), text = c("vol","de"), nb = c(1,2)),
                           tibble2 = tibble(x = c(1,2), y = c(6,6), text = c("m","ort"), nb = c(1,2))),
                    page2 =
                      list(tibble1 = tibble(x = c(1,2,3,4,5), y = c(7,7,7,7,7), text = c("a","l","b","u","s"), nb = c(1,2,3,4,5)),
                           tibble2 = tibble(x = c(1,2,3,4,5), y = c(8,8,8,8,8), text = c("du","m","ble","do","re"), nb = c(1,2,3,4,5))),
                    page3 = 
                      list(tibble1 = tibble(x = c(1,2,3,4), y = c(9,9,9,9), text = c("dr","a","g","o"), nb = c(1,2,3,4)),
                           tibble2 = tibble(x = c(1,2,3,4), y = c(10,10,10,10), text = c("ma","lf","o","y"), nb = c(1,2,3,4)))),
             pdf3 =
               list(page1 =
                      list(tibble1 = tibble(x = c(1,2,3,4,5), y = c(11,11,11,11,11), text = c("s","ev","e","ru","s"), nb = c(1,2,3,4,5)),
                           tibble2 = tibble(x = c(1,2,3,4,5), y = c(12,12,12,12,12), text = c("r","o","g","u","e"), nb = c(1,2,3,4,5))),
                    page2 =
                      list(tibble1 = tibble(x = c(1,2,3), y = c(13,13,13), text = c("r","o","n"), nb = c(1,2,3)),
                           tibble2 = tibble(x = c(1,2,3), y = c(14,14,14), text = c("we","as","ley"), nb = c(1,2,3))),
                    page3 =
                      list(tibble1 = tibble(x = c(1,2,3,4,5,6), y = c(15,15,15,15,15,15), text = c("be","l","la","t","ri","x"), nb = c(1,2,3,4,5,6)),
                           tibble2 = tibble(x = c(1,2,3,4,5,6), y = c(16,16,16,16,16,16), text = c("l","est","r","a","ng","e"), nb = c(1,2,3,4,5,6))),
                    page4 = 
                      list(tibble1 = tibble(x = c(1,2), y = c(17,17), text = c("sir","ius"), nb = c(1,2)),
                           tibble2 = tibble(x = c(1,2), y = c(18,18), text = c("bl","ack"), nb = c(1,2)))))

Desired output (constructed laboriously; dput script below):

enter image description here

dat1 <-
structure(list(pdf = c("pdf1", "pdf1", "pdf1", "pdf1", "pdf1", 
"pdf1", "pdf1", "pdf1", "pdf1", "pdf1", "pdf1", "pdf1", "pdf1", 
"pdf1", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", 
"pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", 
"pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf2", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", 
"pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3", "pdf3"), page = c("page1", 
"page1", "page1", "page1", "page1", "page1", "page1", "page1", 
"page2", "page2", "page2", "page2", "page2", "page2", "page1", 
"page1", "page1", "page1", "page2", "page2", "page2", "page2", 
"page2", "page2", "page2", "page2", "page2", "page2", "page3", 
"page3", "page3", "page3", "page3", "page3", "page3", "page3", 
"page1", "page1", "page1", "page1", "page1", "page1", "page1", 
"page1", "page1", "page1", "page2", "page2", "page2", "page2", 
"page2", "page2", "page3", "page3", "page3", "page3", "page3", 
"page3", "page3", "page3", "page3", "page3", "page3", "page3", 
"page4", "page4", "page4", "page4"), x = c(1, 2, 3, 4, 1, 2, 
3, 4, 1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 
5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 
3, 1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 1, 2), 
    y = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 
    6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 
    10, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 14, 
    14, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 
    17, 18, 18), text = c("ha", "r", "r", "y", "p", "ot", "t", 
    "er", "her", "m", "ione", "gra", "ng", "er", "vol", "de", 
    "m", "ort", "a", "l", "b", "u", "s", "du", "m", "ble", "do", 
    "re", "dr", "a", "g", "o", "ma", "lf", "o", "y", "s", "ev", 
    "e", "ru", "s", "r", "o", "g", "u", "e", "r", "o", "n", "we", 
    "as", "ley", "be", "l", "la", "t", "ri", "x", "l", "est", 
    "r", "a", "ng", "e", "sir", "ius", "bl", "ack"), nb = c(1, 
    2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 
    3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 
    4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5, 6, 
    1, 2, 3, 4, 5, 6, 1, 2, 1, 2), fulltext = c("harry", "harry", 
    "harry", "harry", "potter", "potter", "potter", "potter", 
    "hermione", "hermione", "hermione", "granger", "granger", 
    "granger", "volde", "volde", "mort", "mort", "albus", "albus", 
    "albus", "albus", "albus", "dumbledore", "dumbledore", "dumbledore", 
    "dumbledore", "dumbledore", "drago", "drago", "drago", "drago", 
    "malfoy", "malfoy", "malfoy", "malfoy", "severus", "severus", 
    "severus", "severus", "severus", "rogue", "rogue", "rogue", 
    "rogue", "rogue", "ron", "ron", "ron", "weasley", "weasley", 
    "weasley", "bellatrix", "bellatrix", "bellatrix", "bellatrix", 
    "bellatrix", "bellatrix", "lestrange", "lestrange", "lestrange", 
    "lestrange", "lestrange", "lestrange", "sirius", "sirius", 
    "black", "black"), nbsum = c(10, 10, 10, 10, 10, 10, 10, 
    10, 6, 6, 6, 6, 6, 6, 3, 3, 3, 3, 15, 15, 15, 15, 15, 15, 
    15, 15, 15, 15, 10, 10, 10, 10, 10, 10, 10, 10, 15, 15, 15, 
    15, 15, 15, 15, 15, 15, 15, 6, 6, 6, 6, 6, 6, 21, 21, 21, 
    21, 21, 21, 21, 21, 21, 21, 21, 21, 3, 3, 3, 3)), row.names = c(NA, 
-68L), class = "data.frame")
Origen answered 19/12, 2023 at 18:1 Comment(0)
S
8

The most flexible function for this task is IMO collapse::unlist2d:

library(dplyr)
dat2 <- 
  collapse::unlist2d(dat0, idcols = c("pdf", "page", "tibble")) |> 
  mutate(fulltext = paste(text, collapse = ""), 
         nbsum = sum(nb),
         .by = c(pdf, page, tibble)) |> 
  select(-tibble)

identical(dat1, dat2)
#[1] TRUE
Scenarist answered 19/12, 2023 at 18:8 Comment(2)
Very effective, thank you Maël! But how to create the 2 new variables 'fulltext' and 'nbsum'? This is my main difficulty...Origen
You can use dplyr there, with some grouping, no difficulties!Pheasant
T
3

Here's a pure tidyverse solution:

library(dplyr)
library(purrr)

map_dfr(dat0, ~ map_dfr(.x, 
                        ~bind_rows(.x), 
                        .id = 'page'),
        .id = 'pdf') %>% 
  mutate(fulltext = paste(text, collapse=""),
         nbsum = sum(nb),
         .by = y)

#> # A tibble: 68 × 8
#>    pdf   page      x     y text     nb fulltext nbsum
#>    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#>  1 pdf1  page1     1     1 ha        1 harry       10
#>  2 pdf1  page1     2     1 r         2 harry       10
#>  3 pdf1  page1     3     1 r         3 harry       10
#>  4 pdf1  page1     4     1 y         4 harry       10
#>  5 pdf1  page1     1     2 p         1 potter      10
#>  6 pdf1  page1     2     2 ot        2 potter      10
#>  7 pdf1  page1     3     2 t         3 potter      10
#>  8 pdf1  page1     4     2 er        4 potter      10
#>  9 pdf1  page2     1     3 her       1 hermione     6
#> 10 pdf1  page2     2     3 m         2 hermione     6
#> # ℹ 58 more rows

We can also use rrapply package:

library(rrapply)
library(dplyr)
library(purrr)
library(tidyr)

rrapply(dat0, how = 'bind', options = list(namecols = TRUE)) %>% 
  mutate(fulltext = map_chr(text, .f = paste, collapse=""), 
         nbsum = map_dbl(nb, .f = sum), 
         L3 = NULL) %>% 
  unnest(cols = everything()) %>% 
  rename(setNames(c("L1", "L2"), c("pdf", "page")))

#> # A tibble: 68 × 8
#>    pdf   page      x     y text     nb fulltext nbsum
#>    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#>  1 pdf1  page1     1     1 ha        1 harry       10
#>  2 pdf1  page1     2     1 r         2 harry       10
#>  3 pdf1  page1     3     1 r         3 harry       10
#>  4 pdf1  page1     4     1 y         4 harry       10
#>  5 pdf1  page1     1     2 p         1 potter      10
#>  6 pdf1  page1     2     2 ot        2 potter      10
#>  7 pdf1  page1     3     2 t         3 potter      10
#>  8 pdf1  page1     4     2 er        4 potter      10
#>  9 pdf1  page2     1     3 her       1 hermione     6
#> 10 pdf1  page2     2     3 m         2 hermione     6
#> # ℹ 58 more rows

Created on 2023-12-20 with reprex v2.0.2

Terrel answered 19/12, 2023 at 19:16 Comment(5)
Great! examples using map_dfr are not that common, so thank you very much M--Origen
note that with rrapply(..., how = "bind"), you can set options = list(namecols = TRUE) to include the pdf and page columns in the result as wellPrewitt
@JorisC. I actually was expecting an option like that, but missed it when skimming the docs. Thanks for the comment, I will update my answer in the morning.Terrel
Isn't map_dfr "superseded" and on a deprecation path? Perhaps that's too bad, as that map_dfr solution is pretty nice.Filip
@GregorThomas it's superseded, so it's not on deprecation path; "A superseded function will not emit a warning (since there’s no risk if you keep using it), but the documentation will tell you what we recommend instead ."Terrel
F
2

I believe this works:

library(dplyr)
library(tidyr)
dat0 |>
  bind_rows(.id = "pdf") |>
  pivot_longer(starts_with("page"), names_to = "page") |>
  unnest(value) |>
  arrange(pdf, page, y, x) |>
  mutate(fulltext = paste(text, collapse = ""), nbsum = sum(nb), .by = c(pdf, page, y))
# # A tibble: 68 × 8
#    pdf   page      x     y text     nb fulltext nbsum
#    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#  1 pdf1  page1     1     1 ha        1 harry       10
#  2 pdf1  page1     2     1 r         2 harry       10
#  3 pdf1  page1     3     1 r         3 harry       10
#  4 pdf1  page1     4     1 y         4 harry       10
#  5 pdf1  page1     1     2 p         1 potter      10
#  6 pdf1  page1     2     2 ot        2 potter      10
#  7 pdf1  page1     3     2 t         3 potter      10
#  8 pdf1  page1     4     2 er        4 potter      10
#  9 pdf1  page2     1     3 her       1 hermione     6
# 10 pdf1  page2     2     3 m         2 hermione     6
# # ℹ 58 more rows
# # ℹ Use `print(n = ...)` to see more rows
Filip answered 19/12, 2023 at 18:29 Comment(6)
Thank you to all of you! Maël's approach is shorter, simpler and intuitive for a novice, Allan's is a bit longer but instructive for understanding the use of lapply in multi-nested levels, and Gregor's is original and just as effective. The pending question is: would this script be optimizable with a purrr-based alternative?Origen
What do you mean be "optimizable"? Are you concerned about run time, memory usage? The collapse package Maël uses is about as high-performance as packages get. purrr isn't so much about efficiency as readability.Filip
I misspoke; I wanted to say: what is the most effective script using purrr (even if more complex than Maël's) to meet the objective. I actually spent hours trying with purrr, unfortunately without success. If someone provides such a script, it would surely help many beginners to understand this tool, starting with me.Origen
Well, you could use Allan's lapply approach and just change each lapply to purrr::map. There might be a simplification using list_rbind, I'm not really sure how it's different from bind_rows.Filip
I'd also mention that you have 4 good answers now, and none of them use purrr at all. So maybe purrr isn't a great tool for this job.Filip
I added a solution using purrr::map_dfr. That is actually cleaner than using 4 lapply or map. If we had a requirement to stick to tidyverse functions only, that'd be our best option. Cheers. cc: @Origen p.s. "purrr isn't so much about efficiency as readability" <-- I second that.Terrel
C
1

If you want to create the new variables as you go, you could probably just do this using lapply, since your data is not too deeply nested:

lapply(dat0, unlist, recursive = FALSE) |>
  lapply(function(x) {
    lapply(x, \(x) x %>% 
             mutate(fulltext = paste(text, collapse = ''),
                    nbsum = sum(nb)))}) %>% 
  lapply(dplyr::bind_rows, .id = 'page') %>%
  dplyr::bind_rows(.id = 'pdf') %>%
  mutate(page = gsub('\\..*', '', page))
#> # A tibble: 68 x 8
#>    pdf   page      x     y text     nb fulltext nbsum
#>    <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>    <dbl>
#>  1 pdf1  page1     1     1 ha        1 harry       10
#>  2 pdf1  page1     2     1 r         2 harry       10
#>  3 pdf1  page1     3     1 r         3 harry       10
#>  4 pdf1  page1     4     1 y         4 harry       10
#>  5 pdf1  page1     1     2 p         1 potter      10
#>  6 pdf1  page1     2     2 ot        2 potter      10
#>  7 pdf1  page1     3     2 t         3 potter      10
#>  8 pdf1  page1     4     2 er        4 potter      10
#>  9 pdf1  page2     1     3 her       1 hermione     6
#> 10 pdf1  page2     2     3 m         2 hermione     6
#> # i 58 more rows
#> # i Use `print(n = ...)` to see more rows
Ce answered 19/12, 2023 at 18:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.