Rule based PDF text extraction for verious bills and invoices
Asked Answered
S

2

7

I have to extract text from invoices and bills pdf files

The files layouts can get complex, though its mostly filled with tables.

I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.

Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.

Adobe also has an online service called exportPdf but it can't be customized

Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.

I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.

I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.

I'll be very grateful if someone with such experience give me a hint.

Scholar answered 17/4, 2012 at 10:5 Comment(2)
Unless these PDFs are PDF/A-1a conforming, you are in for a lot of work - you will basically have to do OCR. PDF is not the right format for this; try to get the invoices and bills as properly structured XML or as EDIFACT instead.Friendship
Hey I know this is an old post, but try Tabula github.com/jazzido/tabula-extractorOdorous
S
7

I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.

That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.

Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.

The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.

Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.

Sultan answered 17/4, 2012 at 13:35 Comment(0)
B
3

It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?

Anyway here are some specialized tools including engines they use:

  1. Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
  2. Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
  3. Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)

DISCLAIMER: I work for ByteScout.

Bewhiskered answered 2/3, 2015 at 11:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.