Tabulizer package in R: how to scrape tables after specific Title
Asked Answered
H

2

6

How to scrape tables preceded with some title text from PDF? I am experimenting with tabulizer package. Here an example of getting a table from a specific page (Polish "Map of Public Health Needs")

library(tabulizer)
library(tidyverse)
options(java.parameters = "-Xmx8000m")

location<-"http://www.mpz.mz.gov.pl/wp-content/uploads/sites/4/2019/01/mpz_choroby_ukladu_kostno_miesniowego_woj_dolnoslaskie.pdf"

(out<-extract_tables(location, pages = 8,encoding = "UTF-8", method = "stream", outdir = getwd())[[4]] %>%
as.tibble())

This gets me one table at specific page. But I will have plenty of such pdfs to scrape, from the site: http://www.mpz.mz.gov.pl/mapy-dla-30-grup-chorob-2018/ and then subpages with many links for each illness, getting the links with rvest, for each province of Poland and I need to scrape tables after a specific title string eg.

Tabela 1.2.2: Struktura zapadalnosci rejestrowanej w zależności od płci, miejsca zamieszkania oraz grupy wiekowej - Choroby układowe tkanki łącznej"

I need to detect Tabela(...) Struktura zapadalnosci(...)", because the tables may not be at the same page. Many thanks for any directions and ideas in advance.

EDIT: After I asked the question I succeeded so far to find pages where the table might be, maybe very ineffective:

library(pdfsearch)

pages <-
  keyword_search(
    location,
    keyword = c(
      'Tabela',
      'Struktura zapadalnosci rejestrowanej'
    ),
    path = TRUE,
    surround_lines = FALSE
  ) %>%
  group_by(page_num) %>%
  mutate(keyword = paste0(keyword, collapse = ";")) %>%
  filter(
    str_detect(keyword, "Tabela") &
      str_detect(keyword, "Struktura zapadalnosci rejestrowanej")
  ) %>%
  pull(page_num) %>%
  unique()
Helman answered 28/1, 2019 at 14:8 Comment(0)
L
6

I can help you with your basic problem, but there is one catch (see at the end). I use pdftools instead of pdfsearch but it's basically doing the same in this case (finding pages with a table). In order to save time, I only download the PDF once at the start:

options(java.parameters = "-Xmx8000m")# needs to be set before loading tabulizer
library(tabulizer)
library(tidyverse)

location <- "http://www.mpz.mz.gov.pl/wp-content/uploads/sites/4/2019/01/mpz_choroby_ukladu_kostno_miesniowego_woj_dolnoslaskie.pdf"
download.file(location, "test.pdf", mode = "wb")

Now convert the pdf to a data.frame with each line in a row of the df:

raw <- pdftools::pdf_data("test.pdf") 
pages <- lapply(seq_along(raw), function(p) {
  if (nrow(raw[[p]]) > 0) {
    raw[[p]]$page <- p
    raw[[p]]
  }
}) %>% 
  bind_rows() %>% 
  group_by(y, page) %>% 
  summarise(text = paste(text, collapse = " ")) %>% 
  arrange(page, y)

This data.frame is searchable and we keep only the lines fitting your keyword:

tables <- pages %>% 
  filter(grepl("Tabela .* Struktura zapadalnosci", text))

There are 8 lines which fit the keyphrase. We only extract tables from these. Furthermore, the function within the lapply loop only keeps the matrix with the most rows. If there are two tables on one page that might be a problem but generally it works well to use only the 'best guess' that tabulizer made finding the table structure.

tables_list <- lapply(tables$page, function(p) {
  cat(p, "\n")
  out <- extract_tables("test.pdf", 
                        pages = p,
                        encoding = "UTF-8", 
                        method = "stream", 
                        output = "matrix")
  out <- as_tibble(out[[which.max(sapply(out, nrow) + sapply(out, ncol))]]) # keep the biggest table
  attr(out, "caption") <- tables$text[tables$page %in% p]
  return(out)
})

The object tables_list now contains a list of data.frames, each a converted table:

> tables_list[[1]]
# A tibble: 16 x 8
   V1                  V2    V3    V4    V5    V6    V7    V8   
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 dolnośląskie        77,05 74,65 4,04  10,59 13,37 27,87 44,14
 2 kujawsko-pomorskie  78,12 65,93 4,29  14,96 14,82 27,01 38,92
 3 lubelskie           76,50 56,83 2,67  14,83 17,00 29,00 36,50
 4 lubuskie            79,10 76,23 4,92  12,70 12,70 30,74 38,93
 5 łódzkie             74,37 67,77 6,45  13,84 15,09 30,03 34,59
 6 małopolskie         72,71 55,35 6,99  14,63 12,01 25,87 40,50
 7 mazowieckie         76,31 68,52 5,89  12,11 12,30 27,03 42,67
 8 opolskie            79,55 54,65 4,83  10,04 17,47 26,02 41,64
 9 podkarpackie        75,10 47,32 7,57  14,86 18,29 25,31 33,98
10 podlaskie           74,18 68,00 5,82  10,55 17,09 32,36 34,18
11 pomorskie           76,57 74,96 5,71  12,74 13,76 26,65 41,14
12 śląskie             73,51 81,15 4,89  14,96 14,43 26,64 39,08
13 świętokrzyskie      74,45 56,51 4,91  14,00 14,74 27,27 39,07
14 warmińsko-mazurskie 75,91 63,22 5,62  13,59 18,48 29,53 32,79
15 wielkopolskie       72,66 62,71 3,62  14,37 14,77 29,45 37,79
16 zachodniopomorskie  74,26 73,21 8,44  13,71 11,60 24,89 41,35

I also added the (first line of the) caption of each table as an attribute to the data.frame:

> attr(tables_list[[1]], "caption")
[1] "Tabela 1.2.2: Struktura zapadalnosci rejestrowanej w zależności od płci, miejsca zamieszkania oraz grupy"

Compare this to the pdf:

enter image description here

It seems this worked well, except that the column names are gone. Not sure if there is a way to retain them but that was not included in your question, so maybe you already have a solution?

Loads answered 3/4, 2019 at 16:52 Comment(0)
L
0

You should try Rcrawler.

Its main Rcrawler function seems to be designed specifically for your need - with its KeywordsFilter argument:

KeywordsFilter  

character vector, For users who desires to scrape or collect only web pages that contains some keywords one or more. Rcrawler calculate an accuracy score based of the number of founded keywords. This parameter must be a vector with at least one keyword like c("mykeyword").

Lucilelucilia answered 3/4, 2019 at 15:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.