"black stain" when extracting page to image on PDFBox 2.0.4
Asked Answered
E

2

8

Using PDFBox 2.0.4 to extract pages as image, my result page contains multiple "black holes" as shown in the following screen :

enter image description here

This happen only for this PDF and few others : http://www.filedropper.com/selection_3

Here is a simple code (with JavaFX) to reproduce the problem (change the File path after downloading the PDF) :

public class PDFExtractionTest extends Application {

    @Override
    public void start(Stage primaryStage) throws Exception {
        FileInputStream inputStream = new FileInputStream(new File("C:\\Users\\John\\Desktop\\selection.pdf"));
        PDDocument document = PDDocument.load(inputStream);
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        BufferedImage bufferedImage = pdfRenderer.renderImage(1);
        Image fxImage = SwingFXUtils.toFXImage(bufferedImage, null);

        BorderPane borderPane = new BorderPane();
        ImageView imageView = new ImageView(fxImage);

        borderPane.setCenter(imageView);

        primaryStage.setScene(new Scene(borderPane, 1024, 768));
        primaryStage.show();
    }

     public static void main(String[] args) throws FileNotFoundException {
         launch(args);
     }
}

Here are my dependencies :

  • pdfbox 2.0.4
  • jai-imageio-jpeg2000 1.3.0 (Prevent error : Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed)
  • levigo-jbig2-imageio 1.6.5 (Prevent error : Cannot read JBIG2 image: jbig2-imageio is not installed)

In the logs I have this, but I don't know if it's the cause of the problem. How can I fix it ?

févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Helvetica
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Helvetica

Did I miss something in the code or should I report a bug ?

Evvie answered 1/2, 2017 at 10:35 Comment(6)
Yes seems like the root of the problem: Its a problem with the font mapping which would explain why some parts are missing...Mabye Tilman knows how to fix that...Foreignism
I found exactly the same problem here which relates to this bug. So I suppose there is no solution at the moment I can do except using another library for this case.Evvie
Known problem issues.apache.org/jira/browse/PDFBOX-1752 and no solution. The bug is in JAI. The "No unicode..." is irrelevant here, this is only relevant for text extraction.Robbierobbin
Thanks for the answer. Any idea if JAI will be updated some days or if you still planned to rewrite JPEG2000 decoder from scratch as you mentionned in the bug report ?Evvie
No plans. I see I had that thought in 2014 but I did not keep it (although GSoC 2014 and 2015 went very well).Robbierobbin
Is there now a solution for this problem? I have some PDFs that render perfectly in Java but attempting to render a page to a BufferedImage gives me just a whole black image (with small colored rectangles where references are linked in the text).Petcock
R
3

This is a longtime problem (see PDFBOX-1752). The bug is in JAI, not in PDFBox. The "No unicode..." is irrelevant here, this is only relevant for text extraction.

Check out the jai-imageio-jpeg2000 project, then change the file StdEntropyDecoder.java as in this commit (expanded from this pull request). Build the project and either reference version 1.3.1-SNAPSHOT in your maven pom.xml or copy the jar file into your classpath.

If the jai-imageio-jpeg2000 project team releases a new version that contains that pull request, then you'll no longer have to build yourself.

Additional keywords: black inkblot, black splodge

Robbierobbin answered 1/10, 2018 at 9:59 Comment(2)
As information: the change has been pulled in the meanwhile, but the version 1.3.1 still has not been released, so you still have to compile the source by yourself.Nettienetting
Everybody who reads this: Head over there github.com/jai-imageio/jai-imageio-jpeg2000/pull/24 and write to @stain and vote for a release...Foreignism
F
2

After 13 reminders I got Stian to finally release a new version 1.4.0 of the jai-imageio-jpeg2000 library.

So this thing can finally be solved by upgrading to the latest official library...

Foreignism answered 9/12, 2020 at 15:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.