Opposite of unnest_tokens
Asked Answered
P

2

10

This is most likely a stupid question, but I've googled and googled and can't find a solution. I think it's because I don't know the right way to word my question to search.

I have a data frame that I have converted to tidy text format in R to get rid of stop words. I would now like to 'untidy' that data frame back to its original format.

What's the opposite / inverse command of unnest_tokens?

Edit: here is what the data I'm working with look like. I'm trying to replicate analyses from Silge and Robinson's Tidy Text book but using Italian opera librettos.

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line)
sample_df

character line
FIGARO    Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA   Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE     Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!

I turn it into tidy text so I can get rid of stop words:

tribble <- sample_df %>%
           unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
            anti_join(itstopwords)

Now I have something like this:

text    word
FIGARO  cinque
FIGARO  dieci
FIGARO  venti
FIGARO  trenta
...

I would like to get it back into the format of character name and the associated line to look at other things. Basically I would like the text in the same format it was before, but with stop words removed.

Pansir answered 13/10, 2017 at 16:44 Comment(1)
Hi, please read this and edit your question. Knowing more about what your data are like and what you did will make it possible for other users to help you.Tripoli
B
18

Not a stupid question! The answer depends a bit on exactly what you are trying to do, but here would be my typical approach if I wanted to get my text back to its original form after some processing in its tidied form, using the group_by() function from dplyr.

First, let's go from raw text to a tidied format.

library(tidyverse)
library(tidytext)

tidy_austen <- janeaustenr::austen_books() %>%
    group_by(book) %>%
    mutate(linenumber = row_number()) %>%
    ungroup() %>%
    unnest_tokens(word, text)

tidy_austen
#> # A tibble: 725,055 x 3
#>    book                linenumber word       
#>    <fct>                    <int> <chr>      
#>  1 Sense & Sensibility          1 sense      
#>  2 Sense & Sensibility          1 and        
#>  3 Sense & Sensibility          1 sensibility
#>  4 Sense & Sensibility          3 by         
#>  5 Sense & Sensibility          3 jane       
#>  6 Sense & Sensibility          3 austen     
#>  7 Sense & Sensibility          5 1811       
#>  8 Sense & Sensibility         10 chapter    
#>  9 Sense & Sensibility         10 1          
#> 10 Sense & Sensibility         13 the        
#> # … with 725,045 more rows

The text is tidy now! But we can untidy it, back to something sort of like its original form. I typically approach this using group_by() and summarize() from dplyr, and str_c() from stringr. What does the text look like at the end, in this particular case?

tidy_austen %>% 
    group_by(book, linenumber) %>% 
    summarize(text = str_c(word, collapse = " ")) %>%
    ungroup()
#> # A tibble: 62,272 x 3
#>    book            linenumber text                                         
#>    <fct>                <int> <chr>                                        
#>  1 Sense & Sensib…          1 sense and sensibility                        
#>  2 Sense & Sensib…          3 by jane austen                               
#>  3 Sense & Sensib…          5 1811                                         
#>  4 Sense & Sensib…         10 chapter 1                                    
#>  5 Sense & Sensib…         13 the family of dashwood had long been settled…
#>  6 Sense & Sensib…         14 was large and their residence was at norland…
#>  7 Sense & Sensib…         15 their property where for many generations th…
#>  8 Sense & Sensib…         16 respectable a manner as to engage the genera…
#>  9 Sense & Sensib…         17 surrounding acquaintance the late owner of t…
#> 10 Sense & Sensib…         18 man who lived to a very advanced age and who…
#> # … with 62,262 more rows

Created on 2019-07-11 by the reprex package (v0.3.0)

Brannen answered 13/10, 2017 at 20:37 Comment(0)
B
8
library(tidyverse)
tidy_austen %>% 
     group_by(book,linenumber) %>% 
     summarise(text = str_c(word, collapse = " "))
Breaking answered 20/6, 2018 at 23:23 Comment(4)
Can you explain how this answers the question?Reconstructionism
Unnest_token is a simple operation of separating words and arranging row wise. And the above operation is exactly opposite to it, collapsing words separated by space and grouping them together based on the common key.Breaking
This really is the more obvious and cleaner solution. It also happens to be faster. The answer itself lacks completeness and detail, but IMO group_by and summarize is much more readable than the nest and mutate strategy.Dolomites
I had just come back to this question to update my answer, because I found that str_c() worked for this within summarize(). Nice!Brannen

© 2022 - 2024 — McMap. All rights reserved.