PDFBox adding white spaces within words

Asked 31/10, 2011 at 14:6 Answered 31/1, 2013 at 8:29

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.

I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page : http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training

I've tried with several other PDF files and it seems to be doing same on several pages.

I do the following:

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf

on the downloaded file and you will see spaces in following inserted wrongly in the result on console: "• If ch ildren are able to walk to schoo l safely this could reduce the congestion. "

"• Develops good hab its for later life."

"www.sheff ield.gov.uk"

"Think Ahead!, wh ich is based on the"

etc etc.

As you can see several of words above have spaces between them for no reason I can fathom.

I am on ubuntu and running Sun's JDK 1.6.

I've tried this on several different PDF files and tried searching for solution on forums, there were similar bugs but all seemed to have been resolved.

Any help or if anyone else has same problem please comment. This is causing big problem in indexing the content properly for searching.

Periclean answered 31/10, 2011 at 14:6 Comment(0)

Unfortunately there is currently no easy solution for this.

Internally PDF documents simply contain instructions like "place characters 'abc' in position X" and "place characters 'def' in position Y", and PDFBox tries to reason whether the resulting extracted text should be "abc def" or "abcdef" based on things like the distance between X and Y. These heuristics are generally pretty accurate, but as you can see they don't always produce the correct result.

One way to improve the quality of the extracted text is to try a dictionary lookup on each extracted word or token. If the lookup fails, try combining the token with the next one. If a dictionary lookup on the combined token succeeds, then it's fairly likely that the text extractor has mistakenly added an extra space inside the word. Unfortunately such a feature does not yet exist in PDFBox. See https://issues.apache.org/jira/browse/PDFBOX-1153 for the feature request filed for this. Patches welcome!

Throw answered 31/10, 2011 at 16:58 Comment(2)

Thanks Jukka, sometimes its a relief to just understand why something is not working as expected and also that I am not doing anything that is causing the problem. – Periclean 1/11, 2011 at 9:9

Here is an example of how to build such a term dictionary if you are using Lucene. How to extract a Document Term Vector in Lucene – Anabal 17/1, 2012 at 21:26

The class org.apache.pdfbox.util.PDFTextStripper (pdfbox-1.7.1) allows to modify the propensity to decide if two strings are part of the same word or not.

Increasing spacingTolerance will reduce the number of inserted spaces.

/**
 * Set the space width-based tolerance value that is used
 * to estimate where spaces in text should be added.  Note that the
 * default value for this has been determined from trial and error.
 * Setting this value larger will reduce the number of spaces added. 
 * 
 * @param spacingToleranceValue tolerance / scaling factor to use
 */
public void setSpacingTolerance(float spacingToleranceValue) {
    this.spacingTolerance = spacingToleranceValue;
}

Beebread answered 31/1, 2013 at 8:29 Comment(0)

Recommended topics

Hot tags