iTextSharp - How to get the position of word on a page
Asked Answered
K

1

15

I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is there any way to get the rectangle/position of a word in a PDF using iTextSharp?

Kenzie answered 3/3, 2010 at 23:0 Comment(1)
Did you find a good solution to your problem?Solley
P
21

Yes there is. Check out the text.pdf.parser package, specifically LocationTextExtractionStrategy. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategy to feed into PdfTextExtractor:

MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.

public class MyTexExStrat implements TextExtractionStrategy {
    void beginTextBlock() {}
    void endTextBlock() {}
    void renderImage(ImageRenderInfo info) {}
    void renderText(TextRenderInfo info) {
      // track text and location here.
    }
}

You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.

PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:

Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                               bottomLeft.get(Vector.I2),
                               topRight.get(Vector.I1),
                               topRight.get(Vector.I2));

Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.

Good hunting.

Portulaca answered 1/2, 2011 at 17:50 Comment(2)
Note: The LocationTextExtractionStrategy parser does not necessarily locate text in the order of appearance on the document. I have been putting text into footers (.docx files) then converting them to PDF (with DOCX4J). I've found that parser will find text in, what was the .docx file's footer, then in the body section. i.e. locate the text at the bottom of the document, then text above that. If you need to locate in the order of appearance, you might find that you need to sort your results yourself.Cliffhanger
Check out this link for the C# version #23910393Nones

© 2022 - 2024 — McMap. All rights reserved.