I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following sample code to read the highlighted text.
PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println("Page number : "+pageNum);
for (PDAnnotation pdfAnnot: la) {
if (pdfAnnot.getSubtype().equals("Popup")) {
continue;
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDRectangle rect = pdfAnnot.getRectangle();
float x = rect.getLowerLeftX() - 1;
float y = rect.getUpperRightY() - 1;
float width = rect.getWidth();
float height = rect.getHeight() + rect.getHeight() / 4;
int rotation = page.findRotation();
if (rotation == 0) {
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
}
Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion(Integer.toString(0), awtRect);
stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
System.out.println("Annot type = " + pdfAnnot.getSubtype());
System.out.println("Getting text from region = " + stripper.getTextForRegion(Integer.toString(0)) + "\n");
System.out.println("Getting text from comment = " + pdfAnnot.getContents());
}
}
While reading the highlighted text across the lines, "pdfAnnot.getRectangle()" function returns the minimum rectangle area around the text. This gives more text than required. I could not find any API to extract the exact highlighted text.
For example: Text extracted from test PDF file.
Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
Reader. Recipients of other file formats sometimes can't open files because they
don't have the applications used to create the documents.
Use case 1:
Reading the first bolded text, i.e PDF. No issues in reading the text highlighted in single line. The correct text will be printed as listed below:
Output:
Getting text from region = "PDF"
Use case 2:
Reading the second bolded text, i.e Adobe Acrobat reader, which spans in two lines. In this case, the extracted text on running the above program is:
Output:
Getting text from region = "Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
Reader. Recipients of other file formats sometimes can't open files because they".
The getRectangle() API gives the coordinates of minimum rectangle surrounded by the highlighted text. Hence, it is more text than "Adobe Acrobat Reader".
- How to know the start and end points of the highlighted with in the extract region.
- How to know the number of lines in the extracted region.
Any help will be highly appreciated.