You can create an appropriate matrix for this via casting from tidytext. There are several functions to cast_
, such as cast_sparse()
.
Let's use four example books, and cluster the chapters within the books:
library(tidyverse)
library(tidytext)
library(gutenbergr)
my_mirror <- "http://mirrors.xmission.com/gutenberg/"
books <- gutenberg_download(c(36, 158, 164, 345),
meta_fields = "title",
mirror = my_mirror)
books %>%
count(title)
#> # A tibble: 4 x 2
#> title n
#> * <chr> <int>
#> 1 Dracula 15568
#> 2 Emma 16235
#> 3 The War of the Worlds 6474
#> 4 Twenty Thousand Leagues under the Sea 12135
# break apart the chapters
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ",
ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
glimpse(by_chapter)
#> Rows: 50,315
#> Columns: 3
#> $ gutenberg_id <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, …
#> $ text <chr> "CHAPTER ONE", "", "THE EVE OF THE WAR", "", "", "No one…
#> $ document <chr> "The War of the Worlds_1", "The War of the Worlds_1", "T…
words_sparse <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords(source = "smart")) %>%
count(document, word, sort = TRUE) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
class(words_sparse)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(words_sparse)
#> [1] 182 18124
The words_sparse
object is a sparse matrix created via cast_sparse()
. You can learn more about converting back and forth from tidy and non-tidy formats for text in this chapter.
Now that you have your matrix of word counts (i.e. a document-term matrix, which you could consider weighting by tf-idf instead of counts) you can use kmeans()
. How many chapters from each book were clustered together?
kfit <- kmeans(words_sparse, centers = 4)
enframe(kfit$cluster, value = "cluster") %>%
separate(name, into = c("title", "chapter"), sep = "_") %>%
count(title, cluster) %>%
arrange(cluster)
#> # A tibble: 8 x 3
#> title cluster n
#> <chr> <int> <int>
#> 1 Dracula 1 26
#> 2 The War of the Worlds 1 1
#> 3 Dracula 2 28
#> 4 Emma 2 9
#> 5 The War of the Worlds 2 26
#> 6 Twenty Thousand Leagues under the Sea 2 9
#> 7 Twenty Thousand Leagues under the Sea 3 37
#> 8 Emma 4 46
Created on 2021-02-04 by the reprex package (v1.0.0)
One cluster is all Emma, one cluster is all Twenty Thousand Leagues under the Sea, and one cluster has chapters from all four books.