Refine table extracted from pdf - Tabulizer
Asked Answered
F

2

2

I'm extracting some table from PDF with the help of Tabulizer in R. Below is the code for one of the table

library(tabulizer)

location <- "http://napic.jpph.gov.my/portal/web/guest/main-page?
              p_p_id=ViewPublishings_WAR_ViewPublishingsportlet&
              p_p_lifecycle=2&
              p_p_state=normal&
              p_p_mode=view&
              p_p_resource_id=fileDownload&
              p_p_cacheability=cacheLevelPage&
              p_p_col_id=column-2&
              p_p_col_pos=1&
              p_p_col_count=2&
              _ViewPublishings_WAR_ViewPublishingsportlet_publishingId=433&
              _ViewPublishings_WAR_ViewPublishingsportlet_action=renderReportPeriodScreen&
              _ViewPublishings_WAR_ViewPublishingsportlet_language=&
              _ViewPublishings_WAR_ViewPublishingsportlet_pageno=1&
              publishingId=4537"

out <- extract_tables(location, page=3)

The output of the extracted table has a few quirks, for example it's split into 2 and some data are not properly delimited.

[[1]]
     [,1]       [,2]      [,3]       [,4]       [,5]      [,6]      [,7]      [,8]     [,9]       [,10]    [,11]   [,12]   [,13]     [,14]  
[1,] " Review " "States " "Single  " "2 - 3  "  "Single " "2 - 3 "  "Detach " "Town  " "Cluster " "Low "   "Low "  "Flat " "Condo- " "Total"
[2,] "Period "  ""        "Storey "  "Storey "  "Storey " "Storey " ""        "House " ""         "Cost "  "Cost " ""      "minium/" ""     
[3,] ""         ""        "Terrace " "Terrace " "Semi- "  "Semi- "  ""        ""       ""         "House " "Flat " ""      "Apart-"  ""     
[4,] ""         ""        ""         ""         "Detach " "Detach " ""        ""       ""         ""       ""      ""      "ment"    ""     

[[2]]
      [,1]                               [,2] [,3]         [,4]       [,5]       [,6]       [,7]      [,8]      [,9]       [,10]      [,11]      [,12]      [,13]      
 [1,] "EXISTING STOCK  "                 ""   ""           ""         ""         ""         ""        ""        ""         ""         ""         ""         ""         
 [2,] ""                                 ""   ""           ""         ""         ""         ""        ""        ""         ""         ""         ""         ""         
 [3,] "Q3 2016P WP Kuala Lumpur 21,574 " ""   "66,286 "    "466 "     "5,968 "   "7,098 "   "4,671 "  "4,248 "  "3,786 "   "95,647 "  "50,156 "  "163,119 " "423,019"  
 [4,] "WP Putrajaya 0 "                  ""   "2,102 "     "0 "       "991 "     "203 "     "96 "     "0 "      "0 "       "2,538 "   "0 "       "1,785 "   "7,715"    
 [5,] "WP Labuan 835 "                   ""   "1,044 "     "70 "      "944 "     "5,686 "   "11 "     "0 "      "966 "     "680 "     "1,300 "   "225 "     "11,761"   

The desired output I'm looking for should be close to the original table:

enter image description here

I'm stumped at the moment, appreciate if anyone can point me to the right direction. Thanks in advance.

Freakish answered 17/2, 2017 at 9:55 Comment(0)
S
0

Try:

locate_areas(file, pages = NULL, resolution = 60L, widget = c("shiny",
  "native", "reduced"), copy = FALSE)
  • look how use this tool (you need java)

to find the area that you want to extract,

then you need to process the data to get what you want. It's the only way at moment using tabulizer. Regards.

Sieracki answered 7/5, 2021 at 2:2 Comment(0)
M
0

I used a different approch with the R package RDCOMClient below. With the following code, I have obtained a good results considering that the OCR of Word was used on the image. If I could have the original PDF, I think the result would be even better.

library(RDCOMClient)
library(magick)

################################################
#### Step 1 : We convert the image to a PDF ####
################################################

path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\iqxE9.png"
path_Word <- "C:\\temp.docx"

pdf(path_PDF, height = 12, width = 8)

im <- image_read(path_PNG)
plot(im)

for(i in 1 : 18)
{
  abline(h = 7 + i * 21.87, col = "black")
}

abline(v = 100, col = "black")
abline(v = 287, col = "black")
abline(v = 360, col = "black")
abline(v = 460, col = "black")
abline(v = 530, col = "black")
abline(v = 610, col = "black")
abline(v = 675, col = "black")
abline(v = 745, col = "black")
abline(v = 820, col = "black")
abline(v = 890, col = "black")
abline(v = 960, col = "black")
abline(v = 1034, col = "black")
abline(v = 1120, col = "black")

dev.off()

####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF in word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)

##############################################################
#### Step 3 : We extract the table from the word document ####
##############################################################

nb_Row <- doc$tables(1)$Rows()$Count()
nb_Col <- doc$tables(1)$Columns()$Count()
mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)

for(i in 1 : nb_Row)
{
  for(j in 1 : nb_Col)
  {
    mat_Temp[i, j] <- tryCatch(doc$tables(1)$cell(i, j)$range()$text(), error = function(e) NA)
  }
}

mat_Temp

 [,1]                                [,2]                             [,3]                         [,4]                                  
 [1,] "Review Period\r\a\r\aEXISTING\r\a" "Single\rStorey\rTerrace\r/\r\a" "2 - 3\rStorey\rTerrace\r\a" "/\rSingle\rStorey\rSemi-\rDetach\r\a"
 [2,] NA                                  NA                               NA                           NA                                    
 [3,] "Q3 2016'\r\a"                      "WP Kuala Lumpur 21,574\r\a"     "66,286\r\a"                 "\r\a"                                
 [4,] "\r\a"                              "WP Putrajaya\r\a"               "0\r\a"                      "2,102\r\a"                           
 [5,] "\r\a"                              "WP Labuan\r\a"                  "835\r\a"                    "1,044\r\a"                           
 [6,] "\r\a"                              "Selangor\r\a"                   "161,045\r\a"                "435,179\r\a"                         
 [7,] "\r\a"                              "Johor\r\a"                      "188,845\r\a"                "172,693\r\a"                         
 [8,] "\r\a"                              "Pulau Pinang\r\a"               "44,228\r\a"                 "68,600\r\a"                          
 [9,] "\r\a"                              "Perak\r\a"                      "134,985\r\a"                "100,084\r\a"                         
[10,] "\r\a"                              "Negeri Sembilan 83,030\r\a"     "42,586\r\a"                 "11,159\r\a"                          
[11,] "\r\a"                              "Melaka\r\a"                     "55,901\r\a"                 "29,311\r\a"                          
[12,] "\r\a"                              "Kedah\r\a"                      "73,264\r\a"                 "24,870\r\a"                          
[13,] "\r\a"                              "Pahang\r\a"                     "62,952\r\a"                 "27,670\r\a"                          
[14,] "\r\a"                              "Terengganu\r\a"                 "17,612\r\a"                 "5,859\r\a"                           
[15,] "\r\a"                              "Kelantan\r\a"                   "23,048\r\a"                 "6,220\r\a"                           
[16,] "\r\a"                              "Perlis\r\a"                     "7,598\r\a"                  "1,446\r\a"                           
[17,] "\r\a"                              "Sabah\r\a"                      "19,704\r\a"                 "44,512\r\a"                          
[18,] "\r\a"                              "Sarawak\r\a"                    "45,952\r\a"                 "58,848\r\a"                          
[19,] "\r\a"                              "\r\a"                           "\r\a"                       "\r\a"                                
[20,] "\r\a"                              "\tMALAYSIA\t940,573\r\a"          "\r\a"                       "183,133\r\a"                         
      [,5]                               [,6]          [,7]              [,8]          [,9]                   [,10]                 [,11]        
 [1,] "2 - 3\rStorey\rSemi-\rDetach\r\a" "Detach\r\a"  "Town\rHouse\r\a" "Cluster\r\a" "Low\rCost\rHouse\r\a" "Low\rCost\rFlat\r\a" "Flat\r\a"   
 [2,] NA                                 NA            NA                NA            NA                     NA                    NA           
 [3,] "5,968\r\a"                        "7,098\r\a"   "4,671\r\a"       "4,248\r\a"   "3,786\r\a"            "95,647\r\a"          "50,156\r\a" 
 [4,] "o\r\a"                            "991\r\a"     "203\r\a"         "96\r\a"      "\r\a"                 "0\r\a"               "2,538\r\a"  
 [5,] "70\r\a"                           "944\r\a"     "5,686\r\a"       "11\r\a"      "\r\a"                 "\r\a"                "680\r\a"    
 [6,] "11,946\r\a"                       "35,852\r\a"  "48,311\r\a"      "18,220\r\a"  "8,846\r\a"            "88,193\r\a"          "199254\r\a" 
 [7,] "27,965\r\a"                       "19,616\r\a"  "86,282\r\a"      "1,342\r\a"   "5,537\r\a"            "125,975\r\a"         "45,026\r\a" 
 [8,] "9,536\r\a"                        "19,350\r\a"  "8,593\r\a"       "2,952\r\a"   "8,595\r\a"            "15,530\r\a"          "58,167\r\a" 
 [9,] "22,654\r\a"                       "13,104\r\a"  "60,912\r\a"      "1,374\r\a"   "2,231\r\a"            "80,123\r\a"          "8287\r\a"   
[10,] "5,037\r\a"                        "31,458\r\a"  "1,394\r\a"       "1,807\r\a"   "36,087\r\a"           "10,490\r\a"          "6,637\r\a"  
[11,] "7,878\r\a"                        "3,742\r\a"   "15,550\r\a"      "1 ,312\r\a"  "188\r\a"              "31,440\r\a"          "5,829\r\a"  
[12,] "40,931\r\a"                       "13,017\r\a"  "35,928\r\a"      "590\r\a"     "508\r\a"              "89,107\r\a"          "4,338\r\a"  
[13,] "19,118\r\a"                       "5,360\r\a"   "64,585\r\a"      "213\r\a"     "188\r\a"              "46,835\r\a"          "3,830\r\a"  
[14,] "9,090\r\a"                        "2,626\r\a"   "32,679\r\a"      "154\r\a"     "80\r\a"               "18,559\r\a"          "5,999\r\a"  
[15,] "3,188\r\a"                        "1,172\r\a"   "17,263\r\a"      "O\r\a"       "800\r\a"              "9,798\r\a"           "514\r\a"    
[16,] "3,736\r\a"                        "\r\a"        "613\r\a"         "0\r\a"       "42\r\a"               "7,572\r\a"           "1,378\r\a"  
[17,] "3,224\r\a"                        "1 1\r\a"     "7,294\r\a"       "577\r\a"     "804\r\a"              "14,189\r\a"          "23,757\r\a" 
[18,] "12,172\r\a"                       "30,700\r\a"  "13,804\r\a"      "721\r\a"     "11088\r\a"            "29,930\r\a"          "15,380\r\a" 
[19,] "\r\a"                             "\r\a"        "\r\a"            "\r\a"        "\r\a"                 "\r\a"                "\r\a"       
[20,] "170,208\r\a"                      "436,259\r\a" "33,627\r\a"      "34,962\r\a"  "598,090\r\a"          "481,114\r\a"         "380,345\r\a"
      [,12]                        [,13]            [,14]         
 [1,] "Condo numum\rApartment\r\a" "\r\a"           NA            
 [2,] NA                           "Total\r\a"      NA            
 [3,] "163,119\r\a"                "423,019\r\a"    NA            
 [4,] "\r\a"                       "1,785\r\a"      "7,715\r\a"   
 [5,] "1,300\r\a"                  "225\r\a"        "11,761\r\a"  
 [6,] "154,160\r\a"                "220,038\r\a"    "1 ,381\r\a"  
 [7,] "23,191\r\a"                 "31,589\r\a"     "728,061\r\a" 
 [8,] "115,299\r\a"                "53200\r\a"      "404 ,050\r\a"
 [9,] "2,494\r\a"                  "8,695\r\a"      "434,943\r\a" 
[10,] "14575\r\a"                  "244,260\r\a"    NA            
[11,] "6,465\r\a"                  "10,060\r\a"     "167,676\r\a" 
[12,] "978\r\a"                    "1235\r\a"       "284,766\r\a" 
[13,] "3,135\r\a"                  "7,296\r\a"      "241,182\r\a" 
[14,] "826\r\a"                    "794\r\a"        "94,278\r\a"  
[15,] "1,436\r\a"                  "1,538\r\a"      "64977\r,\r\a"
[16,] "\r\a"                       "480\r\a"        "24,094\r\a"  
[17,] "10,991\r\a"                 "34016\r\a"      "170,964\r\a" 
[18,] "2,881\r\a"                  "12,456\r\a"     "223,932\r\a" 
[19,] "\r\a"                       "\r\a"           "\r\a"        
[20,] "561,101\r\a"                "4,906, 722\r\a" NA    
Molest answered 16/9, 2022 at 22:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.