Not able to read the exact text highlighted across the lines

Asked 16/9, 2015 at 12:3 Answered 17/3, 2021 at 23:40

I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following sample code to read the highlighted text.

PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
        for (int i = 0; i < allPages.size(); i++) {
            int pageNum = i + 1;
            PDPage page = (PDPage) allPages.get(i);
            List<PDAnnotation> la = page.getAnnotations();
            if (la.size() < 1) {
                continue;
            }
            System.out.println("Page number : "+pageNum);
            for (PDAnnotation pdfAnnot: la) {
                if (pdfAnnot.getSubtype().equals("Popup")) {
                    continue;
                }

                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDRectangle rect = pdfAnnot.getRectangle();
                float x = rect.getLowerLeftX() - 1;
                float y = rect.getUpperRightY() - 1;
                float width = rect.getWidth();
                float height = rect.getHeight() + rect.getHeight() / 4;

                int rotation = page.findRotation();
                if (rotation == 0) {
                    PDRectangle pageSize = page.getMediaBox();
                    y = pageSize.getHeight() - y;
                }

                Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                stripper.addRegion(Integer.toString(0), awtRect);
                stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
                System.out.println("Annot type = " + pdfAnnot.getSubtype());
                 System.out.println("Getting text from region = " + stripper.getTextForRegion(Integer.toString(0)) + "\n");
                 System.out.println("Getting text from comment = " + pdfAnnot.getContents());

            }
        }

While reading the highlighted text across the lines, "pdfAnnot.getRectangle()" function returns the minimum rectangle area around the text. This gives more text than required. I could not find any API to extract the exact highlighted text.

For example: Text extracted from test PDF file.

Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat

Reader. Recipients of other file formats sometimes can't open files because they

don't have the applications used to create the documents.

Use case 1: Reading the first bolded text, i.e PDF. No issues in reading the text highlighted in single line. The correct text will be printed as listed below:
Output: Getting text from region = "PDF"

Use case 2: Reading the second bolded text, i.e Adobe Acrobat reader, which spans in two lines. In this case, the extracted text on running the above program is:
Output: Getting text from region = "Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat Reader. Recipients of other file formats sometimes can't open files because they".

The getRectangle() API gives the coordinates of minimum rectangle surrounded by the highlighted text. Hence, it is more text than "Adobe Acrobat Reader".

How to know the start and end points of the highlighted with in the extract region.
How to know the number of lines in the extracted region.

Any help will be highly appreciated.

Arlinda answered 16/9, 2015 at 12:3 Comment(5)

"pdfAnnot.getRectangle()" function returns the minimum rectangle area around the text. - then why don't you use its coordinates as is? That been said, can you share a sample PDF to make your issue better to reproduce? – Antheridium 16/9, 2015 at 12:49

Can you share your sample PDF to make your issue easier to reproduce? Or have you already done so and I keep overlooking the link? – Antheridium 18/9, 2015 at 6:19

I think, though, that you need to look at the QuadPoints instead of the Rect (angle). – Antheridium 18/9, 2015 at 8:1

Thanks. I read the multi line highlighted text using QuadPoints – Arlinda 22/9, 2015 at 14:16

Possible duplicate of Java: Apache PDFbox Extract highlighted text – Antheridium 13/8, 2016 at 13:20

I managed to extract the highlighted text by using the following code.

// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
    ArrayList<String> highlightedTexts = new ArrayList<>();
    // this is the in-memory representation of the PDF document.
    // this will load a document from a file.
    PDDocument document = PDDocument.load(filePath);
    // this represents all pages in a PDF document.
    List<PDPage> allPages =  document.getDocumentCatalog().getAllPages();
    // this represents a single page in a PDF document.
    PDPage page = allPages.get(pageNumber);
    // get  annotation dictionaries
    List<PDAnnotation> annotations = page.getAnnotations();

    for(int i=0; i<annotations.size(); i++) {
        // check subType 
        if(annotations.get(i).getSubtype().equals("Highlight")) {
            // extract highlighted text
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();

            COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
            String str = null;

            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {

                COSFloat ULX = (COSFloat) quadsArray.get(0+k);
                COSFloat ULY = (COSFloat) quadsArray.get(1+k);
                COSFloat URX = (COSFloat) quadsArray.get(2+k);
                COSFloat URY = (COSFloat) quadsArray.get(3+k);
                COSFloat LLX = (COSFloat) quadsArray.get(4+k);
                COSFloat LLY = (COSFloat) quadsArray.get(5+k);
                COSFloat LRX = (COSFloat) quadsArray.get(6+k);
                COSFloat LRY = (COSFloat) quadsArray.get(7+k);

                k+=8;

                float ulx = ULX.floatValue() - 1;                           // upper left x.
                float uly = ULY.floatValue();                               // upper left y.
                float width = URX.floatValue() - LLX.floatValue();          // calculated by upperRightX - lowerLeftX.
                float height = URY.floatValue() - LLY.floatValue();         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = page.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripperByArea.addRegion("highlightedRegion", rectangle_2);
                stripperByArea.extractRegions(page);
                String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
        }
    }
    document.close();

    return highlightedTexts;
}

Presto answered 13/8, 2016 at 5:11 Comment(1)

Please don't post duplicate answers, instead answer one and mark the others as duplicates (as soon as your reputation allows you to). – Antheridium 13/8, 2016 at 13:24

To make the code provided by @roham-amini to work in the current version of Apache PDFBOX (2.0) you have to make a lot of changes.

This code worked fine, I used that in an groovy script in Freeplane. You may need to change logger.info function.

@Grab(group='org.apache.pdfbox', module='pdfbox', version='2.0.22')
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.annotation.*;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationMarkup;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationText;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.apache.pdfbox.pdmodel.common.*;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import java.awt.geom.Rectangle2D;
import org.apache.pdfbox.cos.*




// PDDocument document = new PDDocument();
String pdfFilePath = 'temp.pdf'
PDDocument pdfDoc = PDDocument.load(new File(pdfFilePath));
ArrayList<String> highlightedTexts = new ArrayList<>();

int pageNum=0;
for( PDPage pdfpage : pdfDoc.getPages()-60 )
{
    pageNum++;
    List<PDAnnotation> annotations = pdfpage.getAnnotations();
    //first setup text extraction regions
    for( int i=0; i<annotations.size(); i++ )
    {
        PDAnnotation annot = annotations.get(i);
        annotNote = annot.getContents(); // Conteudo anotado na nota
        annotSubType = annot.getSubtype() // Tipo da nota (Highlight, Text)
        // annotTitle = annot.getTitlePopup(); // Autor da nota
        if( annotSubType.equals('Highlight') )
        {
        // extract highlighted text
            PDFTextStripperByArea stripper = new PDFTextStripperByArea();
            COSArray quadsArray = (COSArray) annot.getCOSObject().getCOSArray(COSName.getPDFName("QuadPoints"));
            String str = null;
            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {
                Float ULX = quadsArray.get(0+k).floatValue();
                Float ULY = quadsArray.get(1+k).floatValue();
                Float URX = quadsArray.get(2+k).floatValue();
                Float URY = quadsArray.get(3+k).floatValue();
                Float LLX = quadsArray.get(4+k).floatValue();
                Float LLY = quadsArray.get(5+k).floatValue();
                Float LRX = quadsArray.get(6+k).floatValue();
                Float LRY = quadsArray.get(7+k).floatValue();
                k+=8;
                float ulx = ULX - 1; // upper left x.
                float uly = ULY; // upper left y.
                float width = URX - LLX;          // calculated by upperRightX - lowerLeftX.
                float height = URY - LLY;         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = pdfpage.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripper.addRegion("highlightedRegion", rectangle_2);
                stripper.extractRegions(pdfpage);
                String highlightedText = stripper.getTextForRegion("highlightedRegion").replaceAll("[\\n\\t ]", " ");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
            logInfo = str;

            logMsg=">>>>>>>>>>Pagina: " + pageNum + ", Sessão: " + annotNote + ", Nota: " + annotNote + "Texto sublinhado: " + logInfo;
            logger.info(logMsg);
        }
    }

}
pdfDoc.close();

Councilor answered 17/3, 2021 at 23:40 Comment(0)

Recommended topics

Hot tags