Python module for converting PDF to text [closed]

Asked 25/8, 2008 at 4:44 Answered 4/2, 2014 at 22:16

Solved python pdf text-extraction pdf-scraping

418

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

Brotherson answered 25/8, 2008 at 4:44 Comment(7)

I was looking for similar solution. I just need to read the text from the pdf file. I don't need the images. pdfminer is a good choice but I didn't find a simple example on how to extract the text. Finally I got this SO answer (#5725778) and now using it. – Demob 2/3, 2016 at 8:43

Since the question got closed I reposted it on the Stack Exchange dedicated to software recommendations in case someone wants to write a new answer: Python module for converting PDF to text – Atoll 28/4, 2017 at 2:47

The only solution that worked for me for UTF-8 content: Apache Tika – Katelyn 4/3, 2018 at 20:41

I would like to update the available options list for PDF to Text conversion in Python, GroupDocs.Conversion Cloud SDK for Python converts PDF to text accurately. – Teage 25/10, 2019 at 14:2

Try using PDFminer.six, see this answer for examples: https://mcmap.net/q/87295/-extracting-text-from-a-pdf-file-using-pdfminer-in-python – Colleague 29/7, 2021 at 13:36

PyPDF2 improved its text extraction a lot! Give it another shot :-) – Reservoir 3/7, 2022 at 11:13

Can we mark this as a duplicate of How to extract text from a PDF file? – Reservoir 14/11, 2022 at 18:12

159

Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format.

The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.

A Python 3 version is available under:

https://github.com/pdfminer/pdfminer.six

Underwear answered 25/8, 2008 at 5:21 Comment(7)

I just added an answer descibing how to use pdfminer as a library. – Trusty 24/11, 2008 at 14:21

The answer I provided in this thread might be useful for people looking at this answer and wondering how to use the library. I give an example on how to use the PDFMiner library to extract text from the PDF. Since the documentation is a bit sparse, I figured it might help a few folks. – Harrietteharrigan 13/2, 2015 at 16:56

Can you please help me figure out how to convert pdf to tagged pdf format? – Land 9/9, 2015 at 5:55

sample code at https://mcmap.net/q/87295/-extracting-text-from-a-pdf-file-using-pdfminer-in-python – Dowager 6/1, 2016 at 21:38

pdfminder comes with a command line utility that is pretty usable with PDFs that are actual text: unixuser.org/~euske/python/pdfminer/index.html#pdf2txt. – Plague 10/3, 2017 at 18:46

Unfortunately pdfminer is not really fast, especially when you are going to use it for long pdf documents with more than 100 pages. – Sg 11/7, 2017 at 5:46

Try PDFminer.six, for examples, see this Stackoverflow answer: https://mcmap.net/q/87295/-extracting-text-from-a-pdf-file-using-pdfminer-in-python – Colleague 29/7, 2021 at 13:35

145

The PDFMiner package has changed since codeape posted.

EDIT (again):

PDFMiner has been updated again in version 20100213

You can check the version you have installed with the following:

>>> import pdfminer
>>> pdfminer.__version__
'20100213'

Here's the updated version (with comments on what I changed/added):

def pdf_to_csv(filename):
    from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
    from pdfminer.converter import LTTextItem, TextConverter
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTTextItem):
                    (_,_,x,y) = child.bbox                   #<-- changed
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)  #<-- changed

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       #<-- changed
    parser.set_document(doc)     #<-- added
    doc.set_parser(parser)       #<-- added
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

Edit (yet again):

Here is an update for the latest version in pypi, 20100619p1. In short I replaced LTTextItem with LTChar and passed an instance of LAParams to the CsvConverter constructor.

def pdf_to_csv(filename):
    from cStringIO import StringIO  
    from pdfminer.converter import LTChar, TextConverter    #<-- changed
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTChar):               #<-- changed
                    (_,_,x,y) = child.bbox                   
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())  #<-- changed
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       
    parser.set_document(doc)     
    doc.set_parser(parser)       
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

EDIT (one more time):

Updated for version 20110515 (thanks to Oeufcoque Penteano!):

def pdf_to_csv(filename):
    from cStringIO import StringIO  
    from pdfminer.converter import LTChar, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item._objs:                #<-- changed
                if isinstance(child, LTChar):
                    (_,_,x,y) = child.bbox                   
                    line = lines[int(-y)]
                    line[x] = child._text.encode(self.codec) #<-- changed

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       
    parser.set_document(doc)     
    doc.set_parser(parser)       
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

Keystroke answered 10/8, 2009 at 20:47 Comment(12)

In [6]: import pdfminer In [7]: pdfminer.__version__ Out[7]: '20100424' In [8]: from pdfminer.converter import LTTextItem ImportError: cannot import name LTTextItem .... LITERALS_DCT_DECODE LTChar LTImage LTPolygon LTTextBox LITERAL_DEVICE_GRAY LTContainer LTLine LTRect LTTextGroup LITERAL_DEVICE_RGB LTFigure LTPage LTText LTTextLine – Roentgenoscope 17/7, 2010 at 22:41

@skyl, the code above is for the previous version '20100213'. From the list of changes on their website, it looks like they changed LTTextItem to LTChar. unixuser.org/~euske/python/pdfminer/index.html#changes – Keystroke 19/7, 2010 at 13:17

If anyone come across this post in need for a decent way to read line by line the pdfs like I wanted to and couldn't manage to on the most recent version, very minor adjustements in this code will make it functional in 20110515: Adding a dir line to see what was going on did all the trick. On line 19, change for child in self.cur_item.objs: to for child in self.cur_item._objs: (just add an underline to objs) ; and on line 23 do the same for text, that is: line[x] = child.text.encode(self.codec) should change to line[x] = child._text.encode(self.codec) – Tradesman 15/6, 2013 at 22:12

See also unixuser.org/~euske/python/pdfminer/programming.html – Tradesman 15/6, 2013 at 22:13

@Oeufcoque Penteano, thanks! I've added another section to the answer for version 20110515 per your comment. – Keystroke 25/6, 2013 at 19:10

It is too bad that I ended up having to use another tool for this that is not open source (although free) to pre process the pdf part. Do you know if it is possible to recover the white spaces using pdfminer? I am getting all characters separated by ';' from this code version, but the whitespaces don't come along. I would really love if I could switch to a open source version :) – Tradesman 26/6, 2013 at 19:15

the latest version works but there is an semicolon after every character, i tried lots of things but wasn't able to fix it. any idea how to avoid it? – Fionafionna 30/6, 2013 at 12:5

It seems to me, that pdfminer is no solution, since it breaks the code on every change! There is also a lack of good documentation, while the download link on the "official" page only links back to its own! – Dobsonfly 31/1, 2014 at 19:50

The answer given by @user3272884 works as of 5-1-2014 – Wristband 2/5, 2014 at 4:15

I had to solve this same problem today, modified tgray's code a bit to extract information about whitespace, posted it here – Drees 29/4, 2016 at 10:28

Why not just post this on Github? – Decoteau 3/8, 2017 at 21:35

The API of PDFminer has changed once again. There's now a fork to pdfminer.six Here is a working example from May 2020. from pdfminer.high_level import extract_text then text = extract_text('report.pdf') https://mcmap.net/q/87295/-extracting-text-from-a-pdf-file-using-pdfminer-in-python – Colleague 17/5, 2020 at 19:14

Since none for these solutions support the latest version of PDFMiner I wrote a simple solution that will return text of a pdf using PDFMiner. This will work for those who are getting import errors with process_pdf

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def pdfparser(data):

    fp = file(data, 'rb')
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # Process each page contained in the document.

    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()

    print data

if __name__ == '__main__':
    pdfparser(sys.argv[1])

See below code that works for Python 3:

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io

def pdfparser(data):

    fp = open(data, 'rb')
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # Process each page contained in the document.

    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()

    print(data)

if __name__ == '__main__':
    pdfparser(sys.argv[1])

Contra answered 4/2, 2014 at 22:16 Comment(9)

this is the first snippet I've found that actually works with weird PDF files (particularly the free ebooks one can get from packtpub). Every other piece of code just return the weirdly encoded raw stuff but yours actually returns text. Thanks! – Mooring 30/1, 2016 at 1:42

You probably want to do retstr.seek(0) after getting data, or you'll accumulate text from all the pages. – Naomanaomi 1/3, 2017 at 18:1

To use with python3, besides the obvious parentheses after the print command, one has to replace the file command with open and import StringIO from the package io – Ewaewald 3/7, 2017 at 15:34

Wow. This block worked perfectly on the first time when I copied it in. Amazing! On to parsing and fixing the data and not having to stress over the inputting it. – Maymaya 7/7, 2017 at 1:18

pdfminer doesn't work for python3. this code doesn't work for pdfminer3k – Slotter 15/9, 2017 at 18:50

how to use it for aws lambda , when using it , it show the error import error – Feaster 27/2, 2018 at 7:32

This is the only answer that works. – Syllabogram 12/10, 2018 at 21:23

I have a pdf file that I want to read, how do i assign it to the 'data' variable? – Homestead 23/9, 2019 at 15:47

The code works with pdfminer.six (pdfminersix.readthedocs.io/en/latest/faq.html). According to their docs, this is the current officially supported fork of the original pdfminer. – Paleogeography 26/4, 2023 at 11:53

Pdftotext An open source program (part of Xpdf) which you could call from python (not what you asked for but might be useful). I've used it with no problems. I think google use it in google desktop.

Hennessy answered 28/8, 2008 at 9:46 Comment(7)

This seems to be the most useful of the tools listed here, with the -layout option to keep text in the same position as is in the PDF. Now if only I could figure out how to pipe the contents of a PDF into it. – Sop 31/5, 2012 at 6:0

After testing several solutions, this one seems like the simplest and most robust option. Can easily be wrapped by Python using a tempfile to dictate where the output is written to. – Sokul 29/10, 2012 at 15:14

Cerin, use '-' as a file name to redirect output to stdout. This way you can use simple subprocess.check_output and this call would feel like an internal function. – Milan 15/7, 2014 at 8:55

Just to re-enforce anyone who is using it . . . pdftotext seems to work very well, but it needs a second argument that is a hyphen, if you want to see the results on stdout. – Gorblimey 2/6, 2015 at 15:53

This will convert recursively all PDF files starting from the current folder: find . -iname "*.pdf" -exec pdftotext -enc UTF-8 -eol unix -raw {} \; By default the generated files take the original name with the .txt extension. – Plague 10/3, 2017 at 19:3

import subprocess subprocess.call(("pdftotext ... ).split()) – Nonprofessional 2/8, 2019 at 13:38

instead of executing the command line call in python, also this option that lets you use it a as library in python: github.com/jalan/pdftotext – Cribb 22/6, 2020 at 21:38

pyPDF works fine (assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can just do:

import pyPdf
pdf = pyPdf.PdfFileReader(open(filename, "rb"))
for page in pdf.pages:
    print page.extractText()

You can also easily get access to the metadata, image data, and so forth.

A comment in the extractText code notes:

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Whether or not this is a problem depends on what you're doing with the text (e.g. if the order doesn't matter, it's fine, or if the generator adds text to the stream in the order it will be displayed, it's fine). I have pyPdf extraction code in daily use, without any problems.

Enchilada answered 7/9, 2008 at 4:47 Comment(12)

no unicode support :( – Araujo 14/10, 2010 at 10:30

pyPdf does support UTF now. – Claymore 18/10, 2012 at 16:19

This library looks like garbage. Testing on a random PDF gives me the error "pyPdf.utils.PdfReadError: EOF marker not found" – Sokul 29/10, 2012 at 14:59

From the question: the text generated had no space between and was of no use. I used pyPDF and got the same result -- text is extracted with no spaces between words. – Transmitter 3/12, 2012 at 17:45

When I execute page.extractText() function I get the error 'TypeError: Can't convert 'bytes' object to str implicitly' How can I deal with that? – Shennashensi 11/11, 2013 at 9:55

Confirmed - no space between text. What on Earth... I think this library is only suitable for resizing, cropping, reordering etc. PDFminer seems a better bet. On Windows I've gotten these features through the (not free) Acrobat Pro (through its slightly awkward Javascript SDK), so even though not good enough for text it seems like good software. – Insider 21/5, 2014 at 19:46

I get ConvertFunctionsToVirtualList object is not callable – Greeley 10/3, 2015 at 15:48

how to get the appropriate spaces between them the output looks messy ? – Angio 1/4, 2015 at 10:10

The link shown above shows that the project is no longer supported and the link to the new author's github returns a 404. Out of date! "This page is no longer updated. I've stopped maintaining pyPdf, and a company named Phaseit has forked the project and continued development and maintenance with my blessing as pyPdf2 ( knowah.github.com/PyPDF2)." I think this answer should probably be downvoted to move it down the page. – Calycle 27/6, 2017 at 13:53

extractText() has issues. It doesn't work for every document. Please update the code if you have a stable solution. – Essene 17/12, 2019 at 8:29

Suggest to use pyPDF2 as per the SO post https://mcmap.net/q/87299/-unable-to-use-pypdf-module – Clearheaded 18/1, 2021 at 5:15

PyPDF2 recently got a huge update :-) – Reservoir 6/6, 2022 at 21:11

You can also quite easily use pdfminer as a library. You have access to the pdf's content model, and can create your own text extraction. I did this to convert pdf contents to semi-colon separated text, using the code below.

The function simply sorts the TextItem content objects according to their y and x coordinates, and outputs items with the same y coordinate as one text line, separating the objects on the same line with ';' characters.

Using this approach, I was able to extract text from a pdf that no other tool was able to extract content suitable for further parsing from. Other tools I tried include pdftotext, ps2ascii and the online tool pdftextonline.com.

pdfminer is an invaluable tool for pdf-scraping.


def pdf_to_csv(filename):
    from pdflib.page import TextItem, TextConverter
    from pdflib.pdfparser import PDFDocument, PDFParser
    from pdflib.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, TextItem):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.text

            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, "ascii")

    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(doc, fp)
    doc.initialize('')

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %d\n" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

UPDATE:

The code above is written against an old version of the API, see my comment below.

Trusty answered 24/11, 2008 at 14:20 Comment(2)

What kind of plugins do you need for that to work mate? I downloaded and installed pdfminer but it's not enough... – Limonene 24/7, 2011 at 17:38

The code above is written against an old version of PDFminer. The API has changed in more recent versions (for instance, the package is now pdfminer, not pdflib). I suggest you have a look at the source of pdf2txt.py in the PDFminer source, the code above was inspired by the old version of that file. – Trusty 25/7, 2011 at 6:4

slate is a project that makes it very simple to use PDFMiner from a library:

>>> with open('example.pdf') as f:
...    doc = slate.PDF(f)
...
>>> doc
[..., ..., ...]
>>> doc[1]
'Text from page 2...'

Tapes answered 31/1, 2011 at 0:27 Comment(5)

I am getting an import error while executing "import slate": {File "C:\Python33\lib\site-packages\slate-0.3-py3.3.egg\slate_init_.py", line 48, in <module> ImportError: cannot import name PDF} But PDF class is there! Do you know how to solve this? – Shennashensi 11/11, 2013 at 10:14

No, this sounds very strange. Do you have the dependencies? – Tapes 11/11, 2013 at 16:50

Normally I get messages about missed dependencies, in this case I get the classic message "import slate File "C:\Python33\lib\site-packages\slate-0.3-py3.3.egg\slate_init_.py", line 48, in <module> ImportError: cannot import name PDF" – Shennashensi 12/11, 2013 at 7:42

Slate 0.3 requires pdfminer 20110515, according to this GitHub issue – Emeldaemelen 7/11, 2014 at 16:4

This package is no longer maintained. Refrain from using it. You can't even use it in Python 3.5 – Asel 6/1, 2017 at 15:32

I needed to convert a specific PDF to plain text within a python module. I used PDFMiner 20110515, after reading through their pdf2txt.py tool I wrote this simple snippet:

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

def to_txt(pdf_path):
    input_ = file(pdf_path, 'rb')
    output = StringIO()

    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    process_pdf(manager, converter, input_)

    return output.getvalue()

Binkley answered 28/5, 2013 at 16:1 Comment(5)

def to_txt(pdf_path): – Gailey 25/10, 2013 at 9:58

if i wanted to only convert a certain number of pages, how would i do it with this code? – Connotation 3/4, 2014 at 8:18

@Connotation Have you tried using the pdf2txt tool? It seems to support that feature in the current version with the -p flag, implementation seems easy to follow and should be easy to customize too: github.com/euske/pdfminer/blob/master/tools/pdf2txt.py Hope it helps! :) – Binkley 3/4, 2014 at 19:45

thanx @Binkley , I tried for all of the above but your solution turns out to be perfect for me ,, output with spaces :) – Angio 14/4, 2015 at 6:30

pdf2txt.py is installed here for me: C:\Python27\Scripts\pdfminer\tools\pdf2txt.py – Guarino 14/9, 2015 at 20:52

Repurposing the pdf2txt.py code that comes with pdfminer; you can make a function that will take a path to the pdf; optionally, an outtype (txt|html|xml|tag) and opts like the commandline pdf2txt {'-o': '/path/to/outfile.txt' ...}. By default, you can call:

convert_pdf(path)

A text file will be created, a sibling on the filesystem to the original pdf.

def convert_pdf(path, outtype='txt', opts={}):
    import sys
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
    from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, TagExtractor
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfdevice import PDFDevice
    from pdfminer.cmapdb import CMapDB

    outfile = path[:-3] + outtype
    outdir = '/'.join(path.split('/')[:-1])

    debug = 0
    # input option
    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    codec = 'utf-8'
    pageno = 1
    scale = 1
    showpageno = True
    laparams = LAParams()
    for (k, v) in opts:
        if k == '-d': debug += 1
        elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
        elif k == '-m': maxpages = int(v)
        elif k == '-P': password = v
        elif k == '-o': outfile = v
        elif k == '-n': laparams = None
        elif k == '-A': laparams.all_texts = True
        elif k == '-D': laparams.writing_mode = v
        elif k == '-M': laparams.char_margin = float(v)
        elif k == '-L': laparams.line_margin = float(v)
        elif k == '-W': laparams.word_margin = float(v)
        elif k == '-O': outdir = v
        elif k == '-t': outtype = v
        elif k == '-c': codec = v
        elif k == '-s': scale = float(v)
    #
    CMapDB.debug = debug
    PDFResourceManager.debug = debug
    PDFDocument.debug = debug
    PDFParser.debug = debug
    PDFPageInterpreter.debug = debug
    PDFDevice.debug = debug
    #
    rsrcmgr = PDFResourceManager()
    if not outtype:
        outtype = 'txt'
        if outfile:
            if outfile.endswith('.htm') or outfile.endswith('.html'):
                outtype = 'html'
            elif outfile.endswith('.xml'):
                outtype = 'xml'
            elif outfile.endswith('.tag'):
                outtype = 'tag'
    if outfile:
        outfp = file(outfile, 'w')
    else:
        outfp = sys.stdout
    if outtype == 'txt':
        device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
    elif outtype == 'xml':
        device = XMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, outdir=outdir)
    elif outtype == 'html':
        device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale, laparams=laparams, outdir=outdir)
    elif outtype == 'tag':
        device = TagExtractor(rsrcmgr, outfp, codec=codec)
    else:
        return usage()

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password)
    fp.close()
    device.close()

    outfp.close()
    return

Roentgenoscope answered 18/7, 2010 at 19:17 Comment(0)

PDFminer gave me perhaps one line [page 1 of 7...] on every page of a pdf file I tried with it.

The best answer I have so far is pdftoipe, or the c++ code it's based on Xpdf.

see my question for what the output of pdftoipe looks like.

Doublespace answered 26/8, 2008 at 2:4 Comment(0)

Additionally there is PDFTextStream which is a commercial Java library that can also be used from Python.

Lilylilyan answered 12/11, 2008 at 17:8 Comment(0)

I have used pdftohtml with the -xml argument, read the result with subprocess.Popen(), that will give you x coord, y coord, width, height, and font, of every snippet of text in the pdf. I think this is what 'evince' probably uses too because the same error messages spew out.

If you need to process columnar data, it gets slightly more complicated as you have to invent an algorithm that suits your pdf file. The problem is that the programs that make PDF files don't really necessarily lay out the text in any logical format. You can try simple sorting algorithms and it works sometimes, but there can be little 'stragglers' and 'strays', pieces of text that don't get put in the order you thought they would. So you have to get creative.

It took me about 5 hours to figure out one for the pdf's I was working on. But it works pretty good now. Good luck.

Faydra answered 12/11, 2010 at 22:34 Comment(0)

Found that solution today. Works great for me. Even rendering PDF pages to PNG images. http://www.swftools.org/gfx_tutorial.html

Wakeful answered 31/1, 2011 at 0:22 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags