what is the best way to extract data from pdf

Asked 14/9, 2019 at 21:42 Answered 12/6, 2021 at 14:8

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.

I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs. I attempted using python with this code

import PyPDF2

try:
   pdfFileObj = open('test.pdf', 'rb')
   pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
   pageNumber = pdfReader.numPages
   page = pdfReader.getPage(0)
   print(pageNumber)

   pagecontent = page.extractText()
   print(pagecontent)
except Exception as e:
   print(e)

but I got stuck on how to find the procurement history. What is the best way to extract the procurement history from the pdf?

Chancellor answered 14/9, 2019 at 21:42 Comment(2)

There are commercial solutions that can do this and in fact we actually have a template that processes these exact documents. Most of the freely available applications fail in their ability to set markers and recognize where things may occur in a flowing document. If you would like an answer that is a commercial application, I would be happy to post it. – Apfelstadt 15/9, 2019 at 0:51

@KevinBrown please post it – Chancellor 15/9, 2019 at 1:25

pdfplumber is the best option. [Reference]

Installation

pip install pdfplumber

Extract all the text

import pdfplumber
path = 'path_to_pdf.pdf'
with pdfplumber.open(path) as pdf:
    for  page  in pdf.pages:
        print(page.extract_text())

Glutenous answered 24/3, 2021 at 16:49 Comment(0)

I did something similar to scrape my grades a long time ago. The easiest (not pretty) solution I found was to convert the pdf to html, then parse the html.

To do so I used pdf2text/pdf2html (https://pypi.org/project/pdf-tools/) and html.
I also used codecs but don't remember exactly the why behind this.

A quick and dirty summary:

from lxml import html
import codecs
import os

# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
    # Do things
file.close()

Just check the result of pdf2text or pdf2html, then using xpath you should extract your information easily.

I hope it helps!

EDIT: comment code

EDIT2: The following code is printing your data

# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")

from lxml import html
import codecs
import os

file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
    line_elements = elm.split()
    # Just observed that what you need starts with a number
    if len(line_elements) > 0 and line_elements[0].isdigit():
        print(line_elements)
file.close();

Kept answered 14/9, 2019 at 22:3 Comment(4)

could you help explain what is going on in the code. it appears the code is first trying the convert the pdf file to text with os.system ?? – Chancellor 14/9, 2019 at 22:24

I commented the code, if you have a specific question go ahead. Yes, first converting, then parsing. – Kept 14/9, 2019 at 22:35

I got this error >> FileNotFoundError: [Errno 2] No such file or directory: 'file.html' . it seems like its not converting the pdf file to html – Chancellor 14/9, 2019 at 22:46

Looks like the tool isn't working the same way anymore. I added a working piece of code to my response. The only manual step was to extract the page 4 of your pdf to the file called test-page4.pdf – Kept 14/9, 2019 at 23:2

OK. I help with the development of this commercial product from opait.com. I took your input PDF and zoned a few areas in it like this:

And also the table you have:

And in about 2 minutes I can extract this from this one and 1000 documents like it. Note this image is the log view and exports that data as CSV. All the blue "links" are the actual data extracted and actually link back into the PDF so you can see where from. The output could be XML or JSON or other formats also. What you see in that screen capture is the log view, all of that is in CSV format (one for the main properties and others for each table linked by a record ID if you had a PDF that had 1000 of these documents in one PDF).

Again, I help with the development with this product but what you ask for can be done. I extracted your entire table but also all the other fields that would be important.

Apfelstadt answered 15/9, 2019 at 1:53 Comment(0)

PDFTron, the company I work for has a fully automated PDF to HTML output solution.

You can try it out here online. https://www.pdftron.com/pdf-tools/pdf-table-extraction

Here is the screenshot of the HTML output for the file you provided. The output contains both HTML tables, and re-flowable text content in between.

The output is standard XML HTML, so you easily parse/manipulate the HTML tables.

Near answered 16/9, 2019 at 22:7 Comment(0)

I work for the company that makes PDFTables. The PDFTables API would help you to solve this problem, and to convert all PDFs at once. It's a simple web based API, so can be called from any programming language. You'll need to create an account at PDFTables.com, then use a script from one of the example languages here: https://pdftables.com/pdf-to-excel-api. Here's an example using Python:

import pdftables_api
import os

c = pdftables_api.Client('MY-API-KEY')

file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"

for file in os.listdir(file_path):
    if file.endswith(".pdf"):
        c.xlsx(os.path.join(file_path,file), file+'.xlsx')

The script looks for all files within a folder that have extension '.pdf', then converts each file to XLSX format. You can change the format to '.csv', '.html' or '.xml'. The first 75 pages are free.

Womack answered 19/9, 2019 at 9:53 Comment(0)

That's four lines of script in IntelliGet

{ start = IsSubstring("CAGE   Contract Number",Line(-2));  
  end = IsEqual(0, Length(Line(1)));
  { start = 1;
    output = Line(0);
  }
}

Eindhoven answered 12/6, 2021 at 14:8 Comment(0)

Recommended topics

Hot tags