How to extract text from a PDF? [closed]
Asked Answered
M

15

188

Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.

We would like that data to be output in xml or json format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.

Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?

Monecious answered 6/9, 2010 at 11:11 Comment(4)
Related question: Extract Images and Words with coordinates and sizes from PDFDuntson
For those needing something really simple (no position info), this perl regex may suffice: /^\s*\[?\((.*?)\)\]?\s*T[Jj]/mg. It just looks for the Tj/TJ operator, which denotes all normal text in a PDF.Bleeder
use TomRoush PdfBox library this works good on androidFleer
Library recommendations are off topic for Stack Overflow. Such questions could be on topic on softwarerecs.stackexchange.com Before asking there, please read their help center and asking guidance.Bandage
C
150

I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:

gswin64c -sDEVICE=txtwrite -o output.txt input.pdf

The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.

Comanche answered 16/10, 2014 at 13:6 Comment(6)
On linux and cygwin the command is gs instead of gswin64c . Works perfectly. No patented paid crap. It just works.Fasano
Yup, works great! Now I can use "grep" with impunity on my pdf files. Since I can grep better than I can read, it's a win! (:-) Upvote.Paratrooper
The only problem I had with this was using it on pdfs with embedded 'old' fonts. Works perfectly for locally generated pdfs, but harder with obscure sources. Otherwise, an excellent scriptlet.Paregmenon
what does -sDEVICE=txtwrite do? I don't understand much after reading How to Use Ghostscript | Selecting an output deviceFoamflower
For stdout output instead of saving as a text file, use gswin64c -sDEVICE=txtwrite -o- input.pdf. Source (slightly changed by me): gist.github.com/drmohundro/560d72ed06baaf16f191ee8be34526acSydney
Thanks so much! I've been struggling to read a PDF with a table in it for almost three days and this plus a simple CSharp script solved it for me. Did a better job than Word, Adobe Acrobat DC, Tabula, or any other tools I've tried. Real life saver.Reckoner
E
45

An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.

http://en.wikipedia.org/wiki/Pdftotext

Ekaterinodar answered 13/8, 2014 at 20:47 Comment(5)
On a sidenote: use the -layout switch to preserve tables, works pretty well.Jennet
Yes, PDFToText works surprisingly well. Nothing’s perfect, but this is the best of the bunch I tried. I like that it has several different algorithms you can pick from. Some algorithms work better with tables, others work better for multi-column text, some preserve spaces and some trim spaces, etc. It’s also surprisingly fast. I had a massive 1200-page PDF and it extracted the text in a matter of seconds, about 5-10x faster than Ghostscript.Lagrange
Official website is xpdfreader.comLagrange
Great! Much better results than with gs. pdftext merges the hyphened words at end line (except when option -layout active) and keeps some inter-word spaces which, strangely, gs merges.Run
2023 note: This is indeed the optimal solution. If you're on macOS you're looking for brew install poppins to get pdftotext and other utils.Prosthodontics
P
31

Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".

TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.

And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.

This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.

Panchito answered 15/9, 2010 at 23:25 Comment(8)
There is no trial version, and $440 is a bit much to "Give it a try."Radu
@Darthenius: You must have missed this sentence: "PDFlib TET can be evaluated without a license, but will only process PDF documents with up to 10 pages and 1 MB size unless a valid license key is applied".Panchito
i tested it, it doesnt recognize columns. I scanned an english tabloid front page. The text was split into 3 columns on the paper, but this plugin mixed the sentences altogether making it look jibberish. Ghostscript which is free had exact same output.Ringnecked
@RedHotScalability: BTW, you may have more luck with this answer, pdftotext section. But I insist you add the -layout param...Panchito
@RedHotScalability: Also BTW, the TET does recognize colums if used with the correct parameters. But I leave it as an exercize to the ambitious JS scripter to read the documentation and find out how...Panchito
Thanks @Kurt. My current use case is being able to recognise text regions, like acknoeledgements, references etc. Do you have any advice about how to go about that?Dunant
Just compared the results from TET, Xpdf pdftotext and Ghostscript. PDF file had Latin and Cyrillic script, and multi-column layout. Xpdf pdftotext was the best, then Ghostscript and the worst was TET.Partan
@Kurt Pfeifle xpdf-tools-win-4.01, Ghostscript 9.26, TET 5.1. Ended up using Apache Tika 1.20Partan
D
22

For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.

Dirkdirks answered 9/3, 2013 at 17:34 Comment(0)
I
13

Here is my suggestion. If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:

https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get

Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.

I hope it helps.

Including answered 10/12, 2013 at 17:2 Comment(3)
I've used that option and I wouldn't recommend it. Google's pdf text extraction isn't as good as many alternatives (esp. for non-English) and it is also very very sloooow.More
I just tested this in the standard Google Docs UI, and I was actually surprised at how well this did. It correctly parsed a document with multiple text columns, and was the only tool I tried that removed line returns where it thought the text was the continuation of a single paragraph, but kept line-returns in other places. It didn't get this perfectly right, and needed some manual refinement, but it appears to be better than most other tools that just force line returns at the end of every line in a PDF.Lagrange
I have deep privacy concerns about google: what they exactly do with the data/files you upload to their drive. As a general rule, always preferring offline methods: gs or pdftotext. It seems to me a waste of energy and resources uploading your file to google, using their servers, and then downloading—My opinion.Run
P
11

PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).

It is available in Java and C#.

Alternatively, you should have a look at Apache PDFBox, open source.

Possessive answered 16/9, 2012 at 20:22 Comment(6)
PdfTextStream in not supported in android. Is there some good libraries like this available for android?Fleer
@Fleer what about PDFBox?Possessive
Yes PdfBox is also not supported in android .... both PdfTextStream and PdfBox uses some awt part which is not supported in androidFleer
i am using this library which works good on android github.com/TomRoush/PdfBox-AndroidFleer
PdfTextStream is available for C# and Java only.Lagrange
@SimonEast you could wrap it as a java service and call it from your other language...Possessive
W
7

One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:

gs \
 -q \
 -dNODISPLAY \
 -dSAFER \
 -dDELAYBIND \
 -dWRITESYSTEMDICT \
 -dSIMPLE \
 -f ps2ascii.ps \
 "${input}" \
 -dQUIET \
 -c quit

I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line.

Wipe answered 25/2, 2014 at 17:19 Comment(0)
L
6

Docotic.Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk.

Docotic.Pdf can be used to extract images from PDFs, too.

Disclaimer: I work for Bit Miracle.

Loomis answered 15/4, 2011 at 15:14 Comment(0)
T
5

As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool "ByteScout PDF Extractor SDK" that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:

Text in the source PDF:

Products | Units | Price 

Output XML:

 <row>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text> 
  </column>
 <column>
  <text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text> 
  </column>
</row>

P.S.: additionally it also breaks the text into a table based structure.

Disclosure: I work for ByteScout

Thermo answered 10/2, 2015 at 14:46 Comment(0)
P
3

The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dCOMPLEX ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   input.pdf ^
   -dQUIET ^
   -c quit

This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.

Panchito answered 7/9, 2010 at 0:13 Comment(4)
As you would guess, this only outputs ASCII test. While free, not a great option for software that you plan to with languages other than English.Sundowner
@userx: As you could guess, this is Free software: therefore source code available. Possible to extend for support of non-ASCII...Panchito
@userx: today I discovered 'TET', the Text Extraction Toolkit from pdflib.com. See my other answer.Panchito
ps2ascii from Ghostscript 9.07 worked beautifully on my OpenBSD system. I just converted a 526-page PDF to plain text. Now I can easily grep and extract text for notes. I used the simple command ps2ascii book.pdf notes.txt. If your document is predominately ASCII, you're in luck.Ophicleide
U
3

I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :

https://gist.github.com/smalot/6183152

In some cases, command line is forbidden for security reasons. So a native PHP class can fit many needs.

Hope it helps everone

Unread answered 8/8, 2013 at 10:4 Comment(0)
P
2

For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):

pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File

Pewit answered 18/2, 2013 at 22:45 Comment(0)
A
2

Apache pdfbox has this feature - the text part is described in:

http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html

for an example implementation see https://github.com/WolfgangFahl/pdfindexer

the testcase TestPdfIndexer.testExtracting shows how it works

Adigranth answered 7/3, 2014 at 13:53 Comment(0)
P
1

QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.

http://www.quickpdflibrary.com/ - They have a 30 day trial.

Partly answered 7/9, 2010 at 14:46 Comment(0)
S
1

On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.

Surroundings answered 12/1, 2015 at 5:24 Comment(1)
The OP looked for a solution for extracting text from a pdf programatically. Your answer proposes a manual routine instead.Palpitate

© 2022 - 2024 — McMap. All rights reserved.