PDFBox retrieve text from overlapping boxes

About

Asked 12/10, 2017 at 21:57 Answered 12/10, 2017 at 21:57

I've had some success using the PDFTextStripperByArea class to retrieve text contained within a specified rectangle. However, some of the PDFs I an scraping have text that is in slightly different places from page to page. I'm looking for help in how to deal with this.

In the example below, I can open the PDF in Acrobat Edit mode and see multiple text boxes (outlines with thin grey lines). I have indicated two regions (purple and red) that I would like to extract text from. However, instead of just getting the text physically inside the rectangle, I'd like all the text from the overlapping text boxes.

Is there a way to do this?

Zenaidazenana answered 12/10, 2017 at 21:57 Comment(3)

please share an example pdf. It is not entirely clear what these "text boxes" are in pdf syntax. They might be all the text drawn add part of a single text object, or they might be all the text drawn inside a rectangle path, or something else entirely. – Catharinecatharsis 13/10, 2017 at 4:29

@Catharinecatharsis The grey boxes are just what comes up in Acrobat when I use Edit mode. I can't see any concept that matches in PDFBox (I thought maybe beads or articles, but think not). Documents have sensitive data so can't share here. I'll see if I can find something else less sensitive with same type of content. – Zenaidazenana 14/10, 2017 at 20:57

@Catharinecatharsis Please see gist.github.com/beldaz/8d658c7ae8d9cb9402ca61f4256c4319 where the text in the bottom right of the page is editable in Acrobat as 7 distinct text boxes. I generated this by replacing the text of an existing PDF, not from scratch. – Zenaidazenana 14/10, 2017 at 21:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags