Recognize PDF table using R
Asked Answered
A

4

24

I'm trying to extract data from tables inside some pdf reports.

I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.

Is there a way to use R to recognize and extract only tables?

Adytum answered 23/5, 2017 at 17:15 Comment(2)
Package pdftables: cran.r-project.org/web/packages/pdftables/pdftables.pdfCalfee
tabulizer (ropensci github)Metrist
F
22

Awsome question, I wondered about the same thing recently, thanks!

I did it, with tabulapdf ‘1.0.5’ (former tabulizer) as @hrbrmstr also suggests.

Installation

  1. make sure Java JRE or JDK is installed on your system
  2. install dependency rJava first

> install.packages("rJava")  
  1. Install tabulapdf
> remotes::install_github("ropensci/tabulapdf")
> ## on 64-bit Windows:
> # remotes::install_github("ropensci/tabulapdf", INSTALL_opts = "--no-multiarch")

Now you are ready to extract tables from your PDF reports.

Example

> library(tabulapdf)
> ## load report
> l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" 
> m <- extract_tables(l, encoding="UTF-8")[[2]]  ## comes as a character matrix
> ## use first row as column names
> dat <- setNames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ])
> ## example-specific date conversion
> dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y")
> dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year))
> dat  ## voilà
   Speed (mph)          Driver                        Car    Engine       Date
1      407.447 Craig Breedlove          Spirit of America    GE J47 1963-08-05
2      413.199       Tom Green           Wingfoot Express    WE J46 1964-10-02
3      434.220      Art Arfons              Green Monster    GE J79 1964-10-05
4      468.719 Craig Breedlove          Spirit of America    GE J79 1964-10-13
5      526.277 Craig Breedlove          Spirit of America    GE J79 1965-10-15
6      536.712      Art Arfons              Green Monster    GE J79 1965-10-27
7      555.127 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-02
8      576.553      Art Arfons              Green Monster    GE J79 1965-11-07
9      600.601 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-15
10     622.407   Gary Gabelich                 Blue Flame    Rocket 1970-10-23
11     633.468   Richard Noble                   Thrust 2 RR RG 146 1983-10-04
12     763.035      Andy Green                 Thrust SSC   RR Spey 1997-10-15

Hope it works for you.

Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub and this kind of stuff.

Flagitious answered 24/5, 2017 at 0:52 Comment(4)
tabulizer can be ridiculously difficult to install. I never got it working on my Mac.Aerostatics
.@jaySf - The issue I am facing is that tabulizer() is reading all the tables but only the header of the table and not the contents of it. Any suggestion how to solve this?Bly
@ChetanArvindPatil Hard to tell w/o any example. I assume that it depends on the software that created the pdf whether tabulator works or not.Flagitious
I found this helpful, but still didnt work completely ... #43885103 gave alternative steps that worked for me. (Win10)Skip
B
8

I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:

library(pdftools)
dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf")
dat <- paste0(dat, collapse = " ")
pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258"
extract <- regmatches(dat, regexpr(pattern, dat))
extract <- gsub('\n', "  ", extract)
strsplit(extract, "\\s{2,}")

From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.

Blither answered 23/5, 2017 at 17:22 Comment(0)
O
2

I have been able to extract the tables of the PDF "https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf" with the following code :

library(pdftools)
library(stringr)
path_PDF <- "https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf"
list_text <- pdftools::pdf_text(path_PDF)
list_text <- lapply(X = list_text, FUN = function(x) strsplit(x, "\n")[[1]])
list_text <- lapply(X = list_text, FUN = function(x) x[x != ""])
list_text <- lapply(X = list_text, FUN = function(x) str_split(x, "[:space:]{3,100}"))

nb_Page <- length(list_text)
vector_Nb_Col_By_Page <- c(5, 4, 5, 4)

for(i in 1 : nb_Page)
{
  list_text[[i]] <- lapply(X = list_text[[i]], FUN = function(x) if(length(x) == vector_Nb_Col_By_Page[i]) x else NULL)
  list_text[[i]] <- do.call("rbind", list_text[[i]])
}

list_text

[[1]]
      [,1]                   [,2]                          [,3]                         [,4]                   [,5]              
 [1,] "Berufsfeuerwehr"      "Straße"                      "PLZ/Ort"                    "Telefon-Nr. Zentrale" "Fax-Nr. Zentrale"
 [2,] "Aachen"               "Stolberger Str. 155"         "52068 Aachen"               "0241.43237-9000"      "0241.512527"     
 [3,] "Altenburg"            "Remsaer Str. 4"              "04600 Altenburg"            "03447.594-354"        "03447.594-359"   
 [4,] "Augsburg"             "Berliner Allee 30"           "86153 Augsburg"             "0821.324-37610"       "0821.324-37634"  
 [5,] "Bautzen"              "Gesundbrunnenring 23"        "02625 Bautzen"              "03591.6798-33"        "03591.6798-66"   
 [6,] "Berlin"               "Voltairestr. 2"              "10179 Berlin"               "030.387-111"          "030.387-10939"   
 [7,] "Bielefeld"            "Am Stadtholz 18"             "33609 Bielefeld"            "0521.512301"          "0521.516590"     
 [8,] "Bochum"               "Brandwacht 1"                "44894 Bochum"               "0234.9254-0"          "0234.9254-909"   
 [9,] "Bonn"                 "Lievelingsweg 112"           "53119 Bonn"                 "0228.717-0"           "0228.664649"     
[10,] "Bottrop"              "Hans-Sachs-Str. 78-80"       "46236 Bottrop"              "02041.7803-0"         "02041.7803-509"  
[11,] "Brandenburg a. d. H." "Fontanestr. 1"               "14770 Brandenburg a. d. H." "03381.6230"           "03381.623151"    
[12,] "Braunschweig"         "Feuerwehrstr. 1"             "38114 Braunschweig"         "0531.2345-0"          "0531.2345-400"   
[13,] "Bremen"               "Am Wandrahm 24"              "28195 Bremen"               "0421.3030-0"          "0421.3030-11560" 
[14,] "Bremerhaven"          "Zur Hexenbrücke 12"          "27570 Bremerhaven"          "0471.590-0"           "0471.590-1269"   
[15,] "Chemnitz"             "Schadestr. 11"               "09112 Chemnitz"             "0371.300641"          "0371.488-3799"   
[16,] "Cottbus"              "Dresdener Str. 46"           "03050 Cottbus"              "0355.632-0"           "0355.632-138"    
[17,] "Cuxhaven"             "Schulstr. 3"                 "27474 Cuxhaven"             "04721.728-0"          "04721.728-244"   
[18,] "Darmstadt"            "Bismarckstr. 86"             "64293 Darmstadt"            "06151.780-0"          "06151.132403"    
[19,] "Delmenhorst"          "Rudolf-Königer-Str. 35"      "27753 Delmenhorst"          "04221.99-0"           "04221.14549"     
[20,] "Dessau-Roßlau"        "Innsbrucker Str. 8"          "06849 Dessau-Roßlau"        "0340.204-1376"        "0340.204-2927"   
[21,] "Dortmund"             "Steinstr. 25"                "44122 Dortmund"             "0231.845-0"           "0231.845-6666"   
[22,] "Dresden"              "Scharfenberger Str. 47"      "01139 Dresden"              "0351.8155-0"          "0351.8155-253"   
[23,] "Duisburg"             "Wintgensstr. 111"            "47058 Duisburg"             "0203.308-0"           "0203.308-4000"   
[24,] "Düsseldorf"           "Hüttenstr. 68"               "40215 Düsseldorf"           "0211.8920-999"        "0211.371574"     
[25,] "Eberswalde"           "Eberswalder Str. 41 a"       "16227 Eberswalde"           "03334.8191811"        "03334.8191822"   
[26,] "Eisenach"             "An der Feuerwache 6"         "99817 Eisenach"             "03691.7220"           "03691.722310"    
[27,] "Erfurt"               "St.-Florian-Str. 4"          "99092 Erfurt"               "0361.741-5100"        "0361.741-5109"   
[28,] "Essen"                "Eiserne Hand 45"             "45139 Essen"                "0201.123-9"           "0201.228233"     
[29,] "Flensburg"            "Munketoft 16"                "24937 Flensburg"            "0461.85-1111"         "0461.85-2925"    
[30,] "Frankfurt am Main"    "Feuerwehrstr. 1"             "60435 Frankfurt am Main"    "069.212-725127"       "069.212-725129"  
[31,] "Frankfurt/Oder"       "Heinrich-Hildebrand-Str. 21" "15232 Frankfurt/Oder"       "0335.5653715"         "0335.5653703"    
[32,] "Freiburg i. Br."      "Eschholzstr. 118"            "79115 Freiburg i. Br."      "0761.201-3315"        "0761.201-3395"   
[33,] "Fürth"                "Helmplatz 2"                 "90762 Fürth"                "0911.974-3600"        "0911.974-3611"   
[34,] "Gelsenkirchen"        "Seestr. 3"                   "45894 Gelsenkirchen"        "0209.1704-0"          "0209.1704-983"   
[35,] "Gera"                 "Berliner Str. 153"           "07546 Gera"                 "0365.48820"           "0365.22222"      
[36,] "Gießen"               "Steinstr. 1"                 "35390 Gießen"               "0641.306-3700"        "0641.306-3709"   
[37,] "Görlitz"              "Krölstr. 26"                 "02826 Görlitz"              "03581.486411"         "03581.486410"    
[38,] "Göttingen"            "Breslauer Str. 10"           "37085 Göttingen"            "0551.70750"           "0551.7075180"    
[39,] "Gotha"                "Oststr. 33"                  "99867 Gotha"                "03621.222560"         "03621.222565"    
[40,] "Greifswald"           "Wolgaster Str. 63 B"         "17489 Greifswald"           "03834.8536-2601"      "03834.8536-2622" 
[41,] "Gütersloh"            "Friedrich-Ebert-Str. 40"     "33330 Gütersloh"            "05241.504450"         "05241.822029"    
[42,] "Hagen"                "Bergischer Ring 87"          "58095 Hagen"                "02331.3740"           "02331.374-4374"  
[43,] "Halle"                "An der Feuerwache 5"         "06124 Halle (Saale)"        "0345.221-5000"        "0345.221-5250"   
[44,] "Hamburg"              "Westphalensweg 1"            "20099 Hamburg"              "040.42851-0"          "040.42851-4119"  
[45,] "Hamm"                 "Hafenstr. 45"                "59067 Hamm"                 "02381.903-0"          "02381.903-525"   
[46,] "Hannover"             "Feuerwehrstr. 1"             "30169 Hannover"             "0511.912-0"           "0511.912-1500"   
[47,] "Heidelberg"           "Baumschulenweg 4"            "69124 Heidelberg"           "06221.58-21100"       "06221.984190"    
[48,] "Heilbronn"            "Beethovenstr. 29"            "74074 Heilbronn"            "07131.56-2100"        "07131.56-3607"   
[49,] "Herne"                "Sodinger Str. 9"             "44623 Herne"                "02323.16-5211"        "02323.16-2970"   
[50,] "Herten"               "An der Feuerwache 7-9"       "45699 Herten"               "02366.31024"          "02366.39258"     

[[2]]
      [,1]                                  [,2]                               [,3]                                           [,4]               
 [1,] "E-Mail-Adresse (allgemein)"          "Internet"                         "Leiter der Berufsfeuerwehr"                   "Telefon-Durchwahl"
 [2,] "[email protected]"            "www.feuerwehr-aachen.de"          "Ltd. BD Dipl.-Ing. Jürgen Wolff"              "0241.43237-0001"  
 [3,] "[email protected]"        "www.feuerwehr-altenburg.eu"       "BAR Ing. Thomas Wust"                         "03447.594-350"    
 [4,] "[email protected]"               "www.feuerwehr-augsburg.de"        "Ltd. BD Dipl.-Chem. Frank Habermaier"         "0821.324-37000"   
 [5,] "[email protected]"                "www.feuerwehr-bautzen.de"         "BOI Markus Bergander"                         "03591.6798-99"    
 [6,] "[email protected]"  "www.berliner-feuerwehr.de"        "LBD Dipl.-Ing. Wilfried Gräfling"             "030.387-10900"    
 [7,] "[email protected]"              "www.feuerwehr-bielefeld.de"       "Ltd. BD Dipl.-Chem. Rainer Kleibrink"         "0521.512294"      
 [8,] "[email protected]"                 "www.bochum.de/feuerwehr"          "Dir. v. FW + RD Dr.-Ing. Dirk Hagebölling"    "0234.9254-500"    
 [9,] "[email protected]"                   "www.bonn.de"                      "Ltd. Städt. BD Dipl.-Ing. Jochen Stein"       "0228.717-762"     
[10,] "[email protected]"               "www.bottrop.de"                   "BD Dipl.-Ing. Kim Heimann"                    "02041.7803-102"   
[11,] "[email protected]"      "www.stadt-brandenburg.de"         "BD Ing. Detlef Wolf"                          "03381.623100"     
[12,] "[email protected]"           "www.feuerwehr.braunschweig.de"    "Ltd. BD Dipl.-Ing. Michael Hanne"             "0531.2345-201"    
[13,] "[email protected]"          "www.feuerwehr-bremen.org"         "Ltd. BD Dipl.-Phys. Karl-Heinz Knorr"         "0421.3030-11500"  
[14,] "[email protected]"  "www.feuerwehr.bremerhaven.de"     "Ltd. BD Dipl.-Wirtsch.-Ing. Jens Cordes"      "0471.590-1200"    
[15,] "[email protected]"          "www.chemnitz.de"                  "Ltd. BD Dipl.-Ing. Bernd Marschner"           "0371.488-3701"    
[16,] "[email protected]"           "www.feuerwehr.cottbus.de"         "BD Jörg Specht"                               "0355.632-113"     
[17,] "[email protected]"         "www.feuerwehr.cuxhaven.de"        "BOAR Dipl.-Ing. (FH) Thomas Gillert"          "04721.728-321"    
[18,] "[email protected]"              "www.feuerwehr-darmstadt.de"       "Ltd. BD Dipl.-Ing. Johann Georg Braxenthaler" "06151.780-1000"   
[19,] "[email protected]"            "www.feuerwehr-delmenhorst.de"     "BOAR Thomas Simon"                            "04221.99-1133"    
[20,] "[email protected]"       "www.dessau-rosslau.de"            "BR Lutz Kuhnhold"                             "0340.204-1037"    
[21,] "[email protected]"                "www.feuerwehr.dortmund.de"        "Dir. d. Fw Dipl.-Ing. Dirk Aschenbrenner"     "0231.845-6000"    
[22,] "[email protected]"                "www.dresden.de/feuerwehr"         "Ltd. Stadtdirektor Ing. Andreas Rümpel"       "0351.8155250"     
[23,] "[email protected]"          "www.duisburg.de"                  "BD Dipl.-Ing. Oliver Tittmann"                "0203.308-2000"    
[24,] "[email protected]"            "www.feuerwehr-duesseldorf.de"     "Dir. d. Fw Dipl.-Phys. Peter Albers"          "0211.3889-102"    
[25,] "[email protected]"             "www.feuerwehr-eberswalde.de"      "StBR Dipl.-Ing. (FH) Nikolaus Meier"          "03334.8191812"    
[26,] "[email protected]"          "www.eisenach.de"                  "BA Jens Claus"                                "03691.673370"     
[27,] "[email protected]"                 "www.erfurt.de/feuerwehr"          "Ltd. BD Dipl.-Ing. (TU) Tobias Bauer"         "0361.741-5000"    
[28,] "[email protected]"             "www.feuerwehr-essen.com"          "Dir. d. Fw Dipl.-Ing. Ulrich Bogdahn"         "0201.123-7000"    
[29,] "[email protected]"        "www.berufsfeuerwehr.flensburg.de" "OBR Carsten Herzog"                           "0461.851120"      
[30,] "[email protected]" "www.feuerwehr-frankfurt.de"       "Dir. d. BD Prof. Dipl.-Ing. Reinhard Ries"    "069.212-720010"   
[31,] "[email protected]"       "www.frankfurt-oder.de"            "BD Dipl.-Ing. (FH) Helmut Otto"               "0335.5653701"     
[32,] "[email protected]"         "www.freiburg.de"                  "Ltd. BD Dipl.-Ing. Ralf-Jörg Hohloch"         "0761.201-3300"    
[33,] "[email protected]"           "www.feuerwehr-fuerth.org"         "BOR Dipl.-Ing. (FH) Christian Gußner"         "0911.974-3613"    
[34,] "[email protected]"          "www.feuerwehr-gelsenkirchen.de"   "Ltd. BD Dipl.-Chem. Michael Axinger"          "0209.1704-200"    
[35,] "[email protected]"                   "www.feuerwehr-gera.de"            "BR Axel Schuh"                                "0365.838-2600"    
[36,] "[email protected]"           "www.feuerwehr.giessen.de"         "BORin Dipl.-Ing. Martina Klee"                "0641.306-3701"    
[37,] "[email protected]"               "www.feuerwehr.goerlitz.de"        "BR Dipl.-Ing. oec. Uwe Restetzki"             "03581.486412"     
[38,] "[email protected]"             "www.feuerwehr.goettingen.de"      "BD Dr. rer. nat. Martin Schäfer"              "0551.7075214"     
[39,] "[email protected]"             "www.feuerwehr-gotha.de"           "BA Andreas Ritter"                            "03621.222560"     
[40,] "[email protected]"             "www.greifswald.de"                "BR Mathias Herenz"                            "03834.8536-2600"  
[41,] "[email protected]"        "www.feuerwehr-guetersloh.de"      "BD Dipl.-Ing. Joachim Koch"                   "05241.82-2003"    
[42,] "[email protected]"            "www.hagen.de"                     "komm. OBR Dipl.-Ing. Veit Lenke"              "02331.374-1100"   
[43,] "[email protected]"                  "www.feuerwehr-halle.de"           "BR Dr.-Ing. Robert Pulz"                      "0345.221-5230"    
[44,] "[email protected]"     "www.feuerwehr.hamburg.de"         "OBD Dipl.-Ing. Klaus Maurer"                  "040.42851-4001"   
[45,] "[email protected]"             "www.feuerwehr-hamm.de"            "Ltd. BD Dipl.-Ing. Wilhelm Tigges"            "02381.903-100"    
[46,] "[email protected]"         "www.feuerwehr-hannover.de"        "Dir. d. Fw Dipl.-Chem. Claus Lange"           "0511.912-1200"    
[47,] "[email protected]"       "www.feuerwehr-heidelberg.de"      "StBD Dr. Georg Belge"                         "06221.58-21000"   
[48,] "[email protected]"        "www.feuerwehr-heilbronn.de"       "BD Eberhard Jochim"                           "07131.56-2101"    
[49,] "[email protected]"                 "www.berufsfeuerwehr.herne.de"     "Ltd. BD Dipl.-Ing. Andreas Spahlinger"        "02323.16-5221"    
[50,] "[email protected]"             "www.herten.de"                    "BR Stefan Lammering"                          "02366.307708"     

[[3]]
      [,1]              [,2]                          [,3]                        [,4]                   [,5]              
 [1,] "Berufsfeuerwehr" "Straße"                      "PLZ/Ort"                   "Telefon-Nr. Zentrale" "Fax-Nr. Zentrale"
 [2,] "Hildesheim"      "An der Feuerwache 4-7"       "31135 Hildesheim"          "05121.3012222"        "05121.12600"     
 [3,] "Hoyerswerda"     "Liselotte-Hermann-Str. 89 a" "02977 Hoyerswerda"         "03571.457360"         "03571.457355"    
 [4,] "Ingolstadt"      "Dreizehnerstr. 1"            "85049 Ingolstadt"          "0841.305-3939"        "0841.305-3999"   
 [5,] "Iserlohn"        "Dortmunder Str. 112"         "58638 Iserlohn"            "02371.806-6"          "02371.806-806"   
 [6,] "Jena"            "Am Anger 28"                 "07743 Jena"                "03641.4040"           "03641.442811"    
 [7,] "Kaiserslautern"  "An der Feuerwache 6"         "67663 Kaiserslautern"      "0631.316052-0"        "–"               
 [8,] "Karlsruhe"       "Ritterstr. 48"               "76137 Karlsruhe"           "0721.133-3750"        "0721.133-3709"   
 [9,] "Kassel"          "Wolfhager Str. 25"           "34117 Kassel"              "0561.7884-0"          "0561.7884-189"   
[10,] "Kiel"            "Westring 325"                "24116 Kiel"                "0431.5905-0"          "0431.5905-147"   
[11,] "Koblenz"         "Schlachthofstr. 2-12"        "56073 Koblenz"             "0261.404040"          "0261.44660"      
[12,] "Köln"            "Scheibenstr. 13"             "50737 Köln"                "0221.9748-0"          "0221.9748-1270"  
[13,] "Krefeld"         "Zur Feuerwache 4"            "47805 Krefeld"             "02151.8213-0"         "02151.8213-300"  
[14,] "Leipzig"         "Goerdelerring 7"             "04109 Leipzig"             "0341.123-0"           "–"               
[15,] "Leverkusen"      "Stixchesstr. 162"            "51371 Leverkusen"          "0214.7505-0"          "0214.7505-381"   
[16,] "Lübeck"          "Bornhövedstr. 10"            "23554 Lübeck"              "0451.122-3800"        "0451.122-3789"   
[17,] "Ludwigshafen"    "Kaiserwörthdamm 1"           "67065 Ludwigshafen"        "0621.504-6110"        "0621.504-6100"   
[18,] "Lünen"           "Kupferstr. 60"               "44532 Lünen"               "02306.767-0"          "02306.767-333"   
[19,] "Magdeburg"       "Peter-Paul-Str. 12"          "39106 Magdeburg"           "0391.54010"           "0391.540-1180"   
[20,] "Mainz"           "Jakob-Leischner-Str. 11"     "55128 Mainz"               "06131.124580"         "06131.124583"    
[21,] "Mannheim"        "Meerfeldstr. 1-5"            "68163 Mannheim"            "0621.32888-0"         "0621.32888-113"  
[22,] "Minden"          "Marienstr. 75"               "32425 Minden"              "0571.8387-0"          "0571.8387-200"   
[23,] "Mönchengladbach" "Stockholtweg 132"            "41238 Mönchengladbach"     "02166.9989-0"         "02166.9989-2114" 
[24,] "Mülheim/Ruhr"    "Zur Alten Dreherei 11"       "45479 Mülheim an der Ruhr" "0208.455-92"          "0208.455-3799"   
[25,] "München"         "An der Hauptfeuerwache 8"    "80331 München"             "089.2353-001"         "089.2353-3182"   
[26,] "Münster"         "York-Ring 25"                "48157 Münster"             "0251.2025-0"          "0251.2025-8010"  
[27,] "Neubrandenburg"  "Ziegelbergstr. 50"           "17033 Neubrandenburg"      "0395.5551522"         "0395.5551555"    
[28,] "Neumünster"      "Färberstr. 105-107"          "24534 Neumünster"          "04321.3322-0"         "04321.3322-191"  
[29,] "Nordhausen"      "Hohekreuzstr. 1"             "99734 Nordhausen"          "03631.61900"          "03631.902375"    
[30,] "Nürnberg"        "Regenstr. 4"                 "90451 Nürnberg"            "0911.231-6400"        "0911.231-6405"   
[31,] "Oberhausen"      "Brücktorstr. 30"             "46047 Oberhausen"          "0208.8585-1"          "0208.8585-228"   
[32,] "Offenbach"       "Rhönstr. 10"                 "63071 Offenbach am Main"   "069.8237990"          "069.8065-3337"   
[33,] "Oldenburg"       "Ibo-Koch-Str. 6"             "26127 Oldenburg"           "0441.235-0"           "–"               
[34,] "Osnabrück"       "Nobbenburger Str. 4"         "49076 Osnabrück"           "0541.327-5112"        "0541.323-2720"   
[35,] "Pforzheim"       "Habermehlstr. 77"            "75172 Pforzheim"           "07231.39-2511"        "07231.39-1517"   
[36,] "Plauen"          "Poeppigstr. 8"               "08529 Plauen"              "03741.484130"         "03741.484110"    
[37,] "Potsdam"         "Holzmarktstr. 6"             "14467 Potsdam"             "0331.37010"           "0331.294195"     
[38,] "Ratingen"        "Voisweg 1-5"                 "40878 Ratingen"            "02102.550-37777"      "02102.550-37902" 
[39,] "Regensburg"      "Greflinger Str. 20"          "93055 Regensburg"          "0941.507-1365"        "0941.507-4369"   
[40,] "Remscheid"       "Auf dem Knapp 23"            "42855 Remscheid"           "02191.16-2400"        "02191.386921"    
[41,] "Reutlingen"      "Hauffstr. 57"                "72762 Reutlingen"          "07121.303-1600"       "07121.303-1788"  
[42,] "Rostock"         "Erich-Schlesinger-Str. 24"   "18059 Rostock"             "0381.381-3711"        "0381.381-3760"   
[43,] "Saarbrücken"     "Hessenweg 7"                 "66111 Saarbrücken"         "0681.3010-0"          "0681.3010-219"   
[44,] "Salzgitter"      "An der Feuerwache 3"         "38226 Salzgitter"          "05341.837-0"          "05341.837-2804"  
[45,] "Schwerin"        "Graf-Yorck-Str. 21"          "19061 Schwerin"            "0385.5000-0"          "0385.5000-117"   
[46,] "Solingen"        "Katternberger Str. 44-46"    "42655 Solingen"            "0212.2202-0"          "0212.2202-149"   
[47,] "Stralsund"       "Fährwall 18"                 "18439 Stralsund"           "03831.253-813"        "03831.253-812"   
[48,] "Stuttgart"       "Mercedesstr. 35"             "70372 Stuttgart"           "0711.5066-0"          "0711.5066-7399"  
[49,] "Trier"           "St.-Barbara-Ufer 40"         "54290 Trier"               "0651.9488-0"          "0651.9488-252"   
[50,] "Weimar"          "Kromsdorfer Str. 13"         "99427 Weimar"              "03643.555-555"        "03643.555-549"   
[51,] "Wiesbaden"       "Kurt-Schumacher-Ring 16"     "65197 Wiesbaden"           "0611.499-0"           "0611.499-190"    
[52,] "Wilhelmshaven"   "Mozartstr. 11-13"            "26382 Wilhelmshaven"       "04421.9818-0"         "04421.9818-180"  
[53,] "Wismar"          "Frische Grube 13"            "23966 Wismar"              "03841.251-0"          "03841.251-3342"  
[54,] "Witten"          "Dortmunder Str. 17"          "58449 Witten"              "02302.923-0"          "02302.81015"     
[55,] "Wolfsburg"       "Dieselstr. 24"               "38446 Wolfsburg"           "05361.844-0"          "05361.844-4276"  
[56,] "Würzburg"        "Hofstallstr. 3"              "97070 Würzburg"            "0931.30906-0"         "0931.30906-520"  
[57,] "Wuppertal"       "August-Bebel-Str. 55"        "42109 Wuppertal"           "0202.563-1111"        "0202.445331"     
[58,] "Zwickau"         "Crimmitschauer Str. 35"      "08056 Zwickau"             "0375.83-5700"         "0375.215764"     

[[4]]
      [,1]                                     [,2]                                [,3]                                                  [,4]               
 [1,] "E-Mail-Adresse (allgemein)"             "Internet"                          "Leiter der Berufsfeuerwehr"                          "Telefon-Durchwahl"
 [2,] "[email protected]"          "www.feuerwehr-hildesheim.de"       "BOAR Dipl.-Ing. Martin Stenz M. A."                  "05121.3012201"    
 [3,] "[email protected]"         "www.hoyerswerda.de"                "BR Ing. Dieter Kowark"                               "03571.457350"     
 [4,] "[email protected]"                "www.berufsfeuerwehr-Ingolstadt.de" "BR Dipl.-Ing. (FH) Ulrich Braun"                     "0841.305-3900"    
 [5,] "[email protected]"                  "www.feuerwehr-iserlohn.de"         "OBR Dipl.-Ing. Christian Eichhorn"                   "02371.806-700"    
 [6,] "[email protected]"                      "www.jena.de"                       "OBR Dipl.-Ing. (TH) Michael Koch"                    "03641.49-9110"    
 [7,] "[email protected]"       "www.feuerwehr-kaiserslautern.de"   "BD Dipl.-Ing. Konrad Schmitt"                        "0631.316052-1371" 
 [8,] "[email protected]"                        "www.feuerwehr-karlsruhe.de"        "BD Dipl.-Ing. Florian Geldner"                       "0721.133-3700"    
 [9,] "[email protected]"                    "www.feuerwehr-kassel.eu"           "Ltd. BD Dipl.-Ing. Norbert Schmitz"                  "0561.7884-101"    
[10,] "[email protected]"                "www.kiel.de"                       "Ltd. BD Thomas Hinz"                                 "0431.5905-121"    
[11,] "[email protected]"              "www.feuerwehr-koblenz.de"          "Dipl.-Ing. (FH) BAR Meik Maxeiner (ab 1.5.2016)"     "0261.404048831"   
[12,] "[email protected]"               "www.stadt-koeln.de"                "Dir. d. Fw Dipl.-Ing. Johannes Feyrer"               "0221.9748-9999"   
[13,] "[email protected]"                   "www.krefeld.de/feuerwehr"          "Ltd. BD Dipl.-Ing. Dietmar Meißner"                  "02151.8213-200"   
[14,] "[email protected]"                   "www.feuerwehr-leipzig.de"          "Ltd. StD Dipl.-Ing. K.-H. Schneider (bis 31.5.2016)" "0341.123-9500"    
[15,] "[email protected]"    "www.leverkusen.de"                 "Ltd. BD Dipl.-Ing. Hermann Greven"                   "0214.7505-300"    
[16,] "[email protected]"                   "www.feuerwehr.luebeck.de"          "Ltd. BD Dipl.-Ing. Oliver Bäth (bis 31.7.2016)"      "0451.122-3700"    
[17,] "[email protected]"              "www.ludwigshafen.de"               "BD Dipl.-Ing. (FH) Peter Friedrich"                  "0621.504-3037"    
[18,] "[email protected]"         "www.feuerwehr-luenen.de"           "BOAR Rainer Ashoff"                                  "02306.767-223"    
[19,] "[email protected]"                 "www.feuerwehr-magdeburg.de"        "Ltd. BD Dipl.-Ing. Helge Langenhan"                  "0391.540-1110"    
[20,] "[email protected]"               "www.feuerwehr-mainz.org"           "BD Dipl.-Ing. Martin Spehr"                          "06131.124500"     
[21,] "[email protected]"                  "www.mannheim.de/feuerwehr/"        "Stadtdirektor Dipl.-Ing. (FH) Thomas Schmitt"        "0621.32888-100"   
[22,] "[email protected]"                      "www.minden112.de"                  "BR Heino Nordmeyer"                                  "0571.8387-192"    
[23,] "[email protected]"          "www.feuerwehr-mg.de"               "Ltd. BD Dipl-Ing. Jörg Lampe"                        "02166.9989-2121"  
[24,] "[email protected]"             "www.feuerwehr-muelheim.de"         "Ltd. BD Dipl.-Ing. Burkhard Klein"                   "0208.455-3701"    
[25,] "[email protected]"   "www.feuerwehr-muenchen.de"         "OBD Dipl.-Ing. Wolfgang Schäuble"                    "089.2353-3100"    
[26,] "[email protected]"            "www.muenster.de/stadt/feuerwehr"   "Ltd. BD Dipl.-Ing. Benno Fritzen"                    "0251.2025-8000"   
[27,] "[email protected]"            "www.neubrandenburg.de"             "BOAR Frank Bühring"                                  "0395.5551523"     
[28,] "[email protected]"         "www.neumuenster.de/feuerwehr"      "BD Dipl.-Ing. Sven Kasulke"                          "04321.3322-101"   
[29,] "[email protected]"                "www.nordhausen-feuerwehr.de"       "BOAR Ing. Gerd Jung"                                 "03631.619012"     
[30,] "[email protected]"                  "www.nuernberg.de"                  "Stadtdirektor Dipl.-Min. Volker Skrok"               "0911.231-6000"    
[31,] "[email protected]"                "www.oberhausen.de"                 "BD Dipl.-Ing. Gerd Auschrat"                         "0208.8585-200"    
[32,] "[email protected]"                 "www.feuerwehr-offenbach.de"        "Ltd. BD Dipl.-Ing. Uwe Sauer"                        "069.8065-3340"    
[33,] "[email protected]"           "www.oldenburg.de"                  "BD Dipl.-Geol. Michael Bremer"                       "0441.235-4321"    
[34,] "[email protected]"                "www.osnabrueck.de"                 "BD Dipl.-Ing. Dietrich Bettenbrock"                  "0541.323-1288"    

Ogata answered 15/9, 2022 at 22:22 Comment(0)
O
1

Here is a different approach that works well on the PDF "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf". You have the result of the output below.

library(RDCOMClient)

path_PDF <- "C:\\ast_sci_data_tables_sample.pdf"
path_Word <- "C:\\Temp.docx"

####################################################################
#### Step 1 : We use the OCR of Word to convert the PDF in word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)

##############################################################
#### Step 2 : We extract the table from the word document ####
##############################################################
nb_Tables <- doc$tables()$count()
list_Table <- list()

for(l in 1 : nb_Tables)
{
  print(l)
  nb_Row <- doc$tables(l)$Rows()$Count()
  nb_Col <- doc$tables(l)$Columns()$Count()
  mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)
  
  for(i in 1 : nb_Row)
  {
    for(j in 1 : nb_Col)
    {
      mat_Temp[i, j] <- tryCatch(doc$tables(l)$cell(i, j)$range()$text(), error = function(e) NA)
    }
  }
  
  list_Table[[l]] <- mat_Temp
}

list_Table

[[1]]
     [,1]                        [,2]                          
[1,] " Number of Coils     \r\a" "    Number of Paperclips\r\a"
[2,] "5 \r\a"                    "3, 5, 4\r\a"                 
[3,] " 10 \r\a"                  "       7, 8, 6\r\a"          
[4,] " 15 \r\a"                  "\t \t11, 10, 12\r\a"           
[5,] " 20 \r\a"                  "\t \t15, 13, 14\r\a"           

[[2]]
      [,1]              [,2]                  [,3]                              [,4]            
 [1,] "Speed (mph)\r\a" "Driver\r\a"          "Car\r\a"                         "Engine\r\a"    
 [2,] "407.447\r\a"     "Craig Breedlove\r\a" "Spirit of America \r\a"          "GE J47\r\a"    
 [3,] "413.199\r\a"     "Tom Green \r\a"      "Wingfoot Express \r\a"           "WE J46  \r\a"  
 [4,] "434.22\r\a"      "Art Arfons\r\a"      "Green Monster \r\a"              "GE J79 \r\a"   
 [5,] "468.719\r\a"     "Craig Breedlove\r\a" "Spirit of America\r\a"           "GE J79 \r\a"   
 [6,] "526.277\r\a"     "Craig Breedlove\r\a" "Spirit of America\r\a"           "GE J79 \r\a"   
 [7,] "536.712\r\a"     "Art Arfons\r\a"      "Green Monster \r\a"              "GE J79  \r\a"  
 [8,] "555.127\r\a"     "Craig Breedlove\r\a" "Spirit of America, Sonic 1 \r\a" "GE J79 \r\a"   
 [9,] "576.553\r\a"     "Art Arfons\r\a"      "Green Monster \r\a"              "GE J79 \r\a"   
[10,] "600.601\r\a"     "Craig Breedlove\r\a" "Spirit of America, Sonic 1\r\a"  "GE J79 \r\a"   
[11,] "622.407\r\a"     "Gary Gabelich\r\a"   "Blue Flame \r\a"                 "Rocket \r\a"   
[12,] "633.468\r\a"     "Richard Noble \r\a"  "Thrust 2 \r\a"                   "RR RG 146 \r\a"
[13,] "763.035\r\a"     "Andy Green\r\a"      "Thrust SSC\r\a"                  "RR Spey\r\a"   
[14,] "\r\a"            "\r\a"                "\r\a"                            NA              
      [,5]            
 [1,] "Date\r\a"      
 [2,] "8/5/63\r\a"    
 [3,] "10/2/64\r\a"   
 [4,] "10/5/64\r\a"   
 [5,] "10/13/64\r\a"  
 [6,] "10/15/65\r\a"  
 [7,] "10/27/65\r\a"  
 [8,] "11/2/65 \r\a"  
 [9,] "11/7/65 \r\a"  
[10,] "11/15/65 \r\a" 
[11,] "10/23/70  \r\a"
[12,] "10/4/83  \r\a" 
[13,] "10/15/97\r\a"  
[14,] NA              

[[3]]
     [,1]                             [,2]                       
[1,] "  Time (drops of water)   \r\a" "        Distance (cm)\r\a"
[2,] "\t \t1 \r\a"                      " 10,11,9\r\a"             
[3,] "\t \t2 \r\a"                      " 29, 31, 30\r\a"          
[4,] "\t \t3 \r\a"                      " 59, 58, 61\r\a"          
[5,] "\t \t4 \r\a"                      " 102, 100, 98\r\a"        
[6,] "\t \t5 \r\a"                      " 122, 125, 127 \r\a"   
Ogata answered 15/9, 2022 at 21:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.