How can I extract text from a PDF file in Perl?
Asked Answered
S

10

24

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.

The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.

Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

Salary answered 16/7, 2009 at 11:39 Comment(2)
Hello guys, thanks for the suggestions. I am using xpdf for extracting text from pdf files with the -raw option which removes those unwanted spaces. But now we want to convert the pdf files to html files for extracting the html formating tags like bold italics etc with the text. I tried to use pdf2html for this but did not find it reliable as tags like sup and sub where missing. We are now using Acrobat Reader to save the pdf files as html file which gives us all the html formatting tags. Is there a way to use Acrobat reader in perl to save multiple pdf files as html files ? Thank you.Salary
Acrobat Professional allows you to have batch jobs. I realize it seems you'd like a free way out, yet, and since you are relying heavily on pdf extraction, getting a single license would have saved you a lot of time and money at this point.Hodosh
A
23

These modules you can acheive the extract text from pdf

PDF::API2

CAM::PDF

CAM::PDF::PageText

From CPAN

   my $pdf = CAM::PDF->new($filename);
   my $pageone_tree = $pdf->getPageContentTree(1);
   print CAM::PDF::PageText->render($pageone_tree);

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

And answered 16/7, 2009 at 14:22 Comment(1)
I'm the CAM::PDF author and I agree with the disclaimers. I built the text extraction on a whim and it turned out to be a lot harder than I anticipated.Iotacism
C
7

I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.

pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.

Collier answered 16/7, 2009 at 12:6 Comment(0)
L
5

You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).

Leverick answered 16/7, 2009 at 13:45 Comment(2)
It worse than this - text need not be laid out on the page in reading order. It need not be laid out rectilinearly. Writing a simple find word command for Acrobat 1.0 took me 5 months, and that's with the people who created all the support libraries and designed the format in adjacent offices. Extracting text is a subset of that problem.Irregularity
Letters not being represented by character codes, but instead by bitmaps or vector graphics, is really pathological these days. Text not being laid out in reading order is kind of normal, but usually the results are intelligible.Eldwen
D
3

There is getpdftext.pl; part of CAM::PDF.

Dalmatia answered 16/7, 2009 at 13:36 Comment(1)
@Chris Dolan It is not that bad either ;-)Encephalography
L
2

Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].

Lavallee answered 20/5, 2011 at 6:25 Comment(0)
H
1

James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.

If on windows go here and download xpdf precompiled binary: http://www.foolabs.com/xpdf/download.html

Then, if you need to run this within perl use system, e.g.,: system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");

where $saveName is the full path to your PDF file.

This hopefully leaves you with a text file you can open and parse in perl.

Haematozoon answered 22/2, 2015 at 21:14 Comment(0)
F
0

i tried this module which is working fine for special characters of pdf..

!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;

my $filename = "pdf.pdf";

my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";
Fullmer answered 12/5, 2016 at 5:33 Comment(0)
C
0

I experimented on different PDF files with

PDF::API2
CAM::PDF
CAM::PDF::PageText

and they are all unreliable, the best way to parse text from PDF files I found is to use the old poppler's pdftotext command line utility. You can

pdftotext ~/your_pdf.pdf - 

and then read stdout from Perl and parse it.

 - at the end means that pdftotext will output content of PDF file to stdout

I found pdftotext reliable, capable of reading text from all PDFs I had to test.

Carencarena answered 24/9, 2023 at 13:25 Comment(0)
F
0

I had to fiddle a bit with 3 bugs

PDF::Burst "defined %dat" -> %dat

LEOCHARRE::CLI2 "defined %ARGV" -> %ARGV

PDF::OCR2::Base "use LEOCHARRE::Debug" -> "use LEOCHARRE::DEBUG"

But then the following guide works.

sudo apt install imagemagick-6.q16
sudo apt install poppler-utils
eval "$(perl -Mlocal::lib)"
cpanm PDF::OCR2

# full pdf2text
perl -MPDF::OCR2 -E '$PDF::OCR2::CHECK_PDF=0;say PDF::OCR2->new(+shift)->text' ./example.pdf
# just a single page
perl -MPDF::OCR2 -E '$PDF::OCR2::CHECK_PDF=0;say PDF::OCR2->new(+shift)->page(1)->text' ./example.pdf

If the resolution isn't high enough. You maybe want to run some image magick shenanigans.

There's probably a whole lot of other tricks to improve accuracy. Google tesseract improve accuracy. convert -resize=200% small.pdf big.pdf

Fanaticize answered 12/4 at 15:39 Comment(0)
B
-2

Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.

Boss answered 16/7, 2009 at 13:42 Comment(1)
does it supports perl..?Beck

© 2022 - 2024 — McMap. All rights reserved.