Apache PDFBox Remove Spaces between characters
Asked Answered
C

2

7

We are using PDFBox to extract text from PDF's.

Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image:

enter image description here

After text extraction we get the following text:
3, 8 5 EU R 1 Netto 38,50 EUR 4,00
(Spaces are added between ',' and '8')

Here is our code:

            PDDocument pdf = PDDocument.load(reuseableInputStream);
            PDFTextStripper pdfStripper = new PDFTextStripper();
            pdfStripper.setSortByPosition(true);
            String text = pdfStripper.getText(pdf);

We tried to play with the PDFTextStripper attributes 'AverageCharTolerance' and 'SpacingTolerance' with no positive effect.

The alternative libary 'iText' extract the text correctly without spaces between the characters. But we can't use it because of license problems.

Any ideas? Thank you.

EDIT: We are using version 1.8.9. We tried also the snapshot version 2.0.0 with no effect.

Camel answered 10/4, 2015 at 6:1 Comment(5)
Can you share a sample PDF? With that we can see if there actually are space characters (even though they might not show) In the file.Unprintable
This documents are customer documents, sorry. I am forbbiden to share this documents :/Camel
forbbiden to share this document - I'm afraid in that case there is nothing to work on here.Unprintable
I'm now able to share a sample PDF. Please contact me via E-Mail [email protected]. I'll send it by E-Mail.Camel
You can find an e-mail address for me in my profile here, simply click on mkl.Unprintable
U
8

The cause

Inspecting the file provided by the OP it turns out that the issue is caused by extra spaces actually being there! There are multiple strings drawn from the same starting position; at every position at most one of those strings has a non-space character. Thus, the PDF viewer output looks good, but PDFBox as text extractor tries to make use of all characters found including those extra space characters.

The behavior can be reproduced using a PDF with this content stream with F0 being Courier:

BT
/F0 9 Tf
100 500 Td
(             2                                                                  Netto        5,00 EUR 3,00) Tj
0 0 Td
(                2882892  ENERGIZE LR6 Industrial                     2,50 EUR 1) Tj
ET

In a PDF viewer this looks like this:

Screenshot

Copy & paste from Adobe Reader results in

2 2 8 8 2 8 9 2 E N E R G I Z E L R 6 I n d u s t r i a l 2 , 5 0 E U R 1 Netto 5,00 EUR 3,00

Regular extraction using PDFBox results in

             2    2 8 8 2 89 2    E N E RG  IZ  E  L R 6  I n du s t  ri  a l                      2 ,5  0  EU  R  1 Netto        5,00 EUR 3,00

Thus, not only PDFBox has problems here, these two outputs look different but the extra spaces are a problem either way.

I would propose telling the producer of those PDFs that they are difficult to post-process, even for widely-used software like Adobe Reader.

A work-around

To extract something sensible from this we have to somehow ignore the (actually existing!) extra spaces. As there is no way to ad hoc know which spaces can be used later on and which not, we simply remove all and hope PDFBox adds spaces where necessary:

String extractNoSpaces(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void processTextPosition(TextPosition text)
        {
            String character = text.getCharacter();
            if (character != null && character.trim().length() != 0)
                super.processTextPosition(text);
        }
    };
    stripper.setSortByPosition(true);
    return stripper.getText(document);
}

(ExtractWithoutExtraSpaces.java)

Using this method with the test document we get:

2 2882892 ENERGIZE LR6 Industrial 2,50 EUR 1 Netto 5,00 EUR 3,00

Different text extractors

The alternative libary 'iText' extract the text correctly without spaces between the characters

This is due to iText extracting text string by string, not character by character. This procedure has its own perils but in this case results in something more usable out-of-the-box.

Unprintable answered 24/6, 2015 at 17:48 Comment(0)
F
1

On newer versions of PDFBox the workaround doesn't work. But you can fix the problem space and achieve the same result just setting your PDFTextStripper like that:

PDFTextStripper strippet = new PDFTextStripper();
stripper.setWordSeparator("");
Fact answered 11/6, 2021 at 16:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.