Reading data from PDF files into R

G

6

51

Is that even possible!?!

I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool?

The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".

Groscr answered 7/2, 2012 at 23:46 Comment(6)

Taking a glance at CRAN, there doesn't appear to be any library that does that. You might be better off using another language that has such libraries (Perl and Python, for example, both have them), grabbing the data that you need, and then writing it to a file that can be read by R. – Isogonic 7/2, 2012 at 23:51

@JackManey Thanks, that's what I thought. There is readPDF in the tm package (text mining), but it isn't exactly user friendly and I think it uses the command line utility pdftotext under the hood anyway. – Groscr 7/2, 2012 at 23:56

You have my sympathies. Maybe some day we'll live in a world where all data is available as data! – Symmetrical 8/2, 2012 at 0:21

@gsk3 (+1) I appreciate the condolences... I spend most of my days wishing that. And since people are paying attention and I didn't look hard enough... (#3852854) confirms my suspicions. – Groscr 8/2, 2012 at 0:29

There is also the grImport package, which can read PDF files, but it is designed to extract vector graphics -- the text will also be there, but perhaps not in a very useable form. – Rafaelita 8/2, 2012 at 0:57

I've never had success with tm::readPDF, but managed a work-around using pdftotext in my R workflow like this: https://mcmap.net/q/355000/-readpdf-tm-package-in-r – Downstate 2/7, 2014 at 18:25

K

23

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)

Kaiser answered 8/2, 2012 at 2:28 Comment(3)

I wish! Some of us don't have grad students to do our bidding. And I'm too low on the totem pole to hire interns (read lackeys). But good advice! – Groscr 8/2, 2012 at 2:44

@CarlWitthoft I'll accept your answer! Particularly the last line. – Groscr 9/2, 2012 at 15:41

Humans are lousy. I know because I am one and I know lots of others. They excel at three things: solving novel problems; creativity (music, arts and literature); and interpersonal emotional support or persuasion. They can not be relied upon to transcribe. – Churchill 29/11, 2013 at 15:19

G

31

So... this gets me close even on a fairly complex table.

Download a sample pdf from bmi pdf

library(tm)

pdf <- readPDF(PdftotextOptions = "-layout")

dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')

dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)

Groscr answered 8/2, 2012 at 0:43 Comment(3)

I am running into problems which I do not know how to solve. The following line dat <- pdf(elem = list(uri='C:/Users/Farrel/Downloads/bmi_tbl.pdf'), language='en', id='id1') produces the following error

Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") :   cannot open file 'C:\Users\Farrel\AppData\Local\Temp\RtmpegXWQ3\pdfinfo57c9716105': No such file or directory

. – Churchill 29/11, 2013 at 15:44

it does not seem to work for me. I want to extract some text from that. let me know how I can do it. – Assort 23/3, 2015 at 4:48

it shows an error unused argument(PdftotextOptions = "-layout") Calls – Dullard 3/8, 2017 at 9:45

K

23

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)

Kaiser answered 8/2, 2012 at 2:28 Comment(3)

I wish! Some of us don't have grad students to do our bidding. And I'm too low on the totem pole to hire interns (read lackeys). But good advice! – Groscr 8/2, 2012 at 2:44

@CarlWitthoft I'll accept your answer! Particularly the last line. – Groscr 9/2, 2012 at 15:41

Humans are lousy. I know because I am one and I know lots of others. They excel at three things: solving novel problems; creativity (music, arts and literature); and interpersonal emotional support or persuasion. They can not be relied upon to transcribe. – Churchill 29/11, 2013 at 15:19

D

13

The current package du jour for getting text out of PDFs is pdftools (successor to Rpoppler, noted above), works great on Linux, Windows and OSX:

install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

Downstate answered 7/9, 2016 at 6:42 Comment(4)

I like this package. – Worl 13/9, 2016 at 18:47

You may also find github.com/ropenscilabs/tabulizer useful for extracting data from tables in PDF files – Downstate 13/9, 2016 at 20:44

@Downstate this worked first time for me; great answer. btw: the French phrase is "du jour" which translates to "of the day", not "de jour" meaning "of day". Sorry to be pedantic :-) – Mirianmirielle 15/12, 2016 at 20:50

@Mirianmirielle Merci beaucoup pour votre commentaire ;) – Downstate 16/12, 2016 at 0:17

C

6

You can also (now) use the new (2015-07) Rpoppler pacakge:

Rpoppler::PDF_text(file)

It includes 3 functions (4, really, but one just gets you a ptr to the PDF object):

PDF_fonts PDF font information
PDF_info PDF document information
PDF_text PDF text extraction

(posting as an answer to help new searchers find the package).

Coppola answered 20/10, 2015 at 11:26 Comment(0)

H

4

per zx8754 ... the following works in Win7 with pdftotext.exe in the working directory:

library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                language = "en", id = "id1")

Herbal answered 26/6, 2015 at 13:18 Comment(0)

S

0

Here is another that can be used with Acrobat Pro :

library(RDCOMClient)
acrobat_App <- COMCreate("AcroExch.App")
acrobat_PDDoc <- COMCreate("AcroExch.PDDoc")
acrobat_AVDoc <- COMCreate("AcroExch.AVDoc")
acrobat_PageContent <- COMCreate("AcroExch.HiliteList")
objADOStream <- COMCreate("ADODB.Stream")
acrobat_AVDoc$open("C:\\my_PDF.pdf", 1)
av_Doc <- acrobat_App$GetActiveDoc()
pdf_doc <- av_Doc$GetPDDoc()
pdf_doc$GetNumPages()
page_Number <- pdf_doc$AcquirePage(1)
acrobat_PageContent$Add(0, 9000)
sel_Text <- page_Number$CreatePageHilite(acrobat_PageContent)
index <- 0 : (sel_Text$GetNumText() - 1)
vec_Char <- rep("", length(index))

for(i in index)
{
  print(i)
  vec_Char[i] <- sel_Text$GetText(i)
}

Snooperscope answered 14/3 at 20:3 Comment(0)

Recommended topics

Hot tags