Extract Images from PDF with Apache Tika
Asked Answered
C

2

5

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work.

My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline.

I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the files.

Could someone suggest how I might achieve the above, preferably with some code examples of how to do inline image extraction from PDFs with Tika 1.6?

Crutchfield answered 11/9, 2014 at 8:58 Comment(5)
TIKA-1268 and TIKA-1396 were both marked as fixed in 1.6, are you sure you're really using Tika 1.6 for this?Peterec
Assuming that the one marked 1.6 on the website and that is called tika-app-1.6.jar is actually Tika 1.6, then yes I'm sure!Crutchfield
And you're trying the Tika App with the --extract flag to test the image extraction?Peterec
I'm trying to do it programmatically, but I've tried the --extract flag and using the GUI and haven't successfully managed to find the images in the document with either methods.Crutchfield
Sounds like you need to hop onto one of those bugs then, and flag up that it isn't properly fixed yetPeterec
R
4

It is possible to use an AutoDetectParser to extract images, without relying on PDFParser. This code works just as well for extracting images out from docx, pptx, etc.

Here I have a parseDocument() and a setPdfConfig() function which makes use of an AutoDetectParser.

  1. I create an AutoDetectParser
  2. Attach a EmbeddedDocumentExtractor onto a ParseContext.
  3. Attach the AutoDetectParser onto the same ParseContext.
  4. Attach a PDFParserConfig onto the same ParseContext.
  5. Then give that ParseContext to AutoDetectParser.parse().

The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/.

private static void setPdfConfig(ParseContext context) {
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(true);

    context.set(PDFParserConfig.class, pdfConfig);
}

private static String parseDocument(String path) {
    String xhtmlContents = "";

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new ToXMLContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputDir = new File(path + "_").toPath();
            Files.createDirectories(outputDir);

            Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.deleteIfExists(outputPath);
            Files.copy(stream, outputPath);
        }
    };

    context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
    context.set(AutoDetectParser.class, parser);

    setPdfConfig(context);

    try (InputStream stream = new FileInputStream(path)) {
        parser.parse(stream, handler, metadata, context);
        xhtmlContents = handler.toString();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException | TikaException e) {
        e.printStackTrace();
    }

    return xhtmlContents;
}
Rejection answered 12/8, 2018 at 8:11 Comment(5)
I think you meant AutoDetectParser.class where you specified AutoParser.class?Vanderbilt
@Vanderbilt Fixed!Rejection
I have used your solution on a few pdfs. On one, it found both images and saved them. On another, it only saved one of ten images. Stranger, it wasn't the first one that appeared in the document. Do you have any thoughts on what might have happened? The call to parseEmbedded happens only once in this case.Vanderbilt
@Vanderbilt If you set this line to false, does it help? pdfConfig.setExtractUniqueInlineImagesOnly(true);Rejection
Thanks; I did. It didn't help. (I tinkered with lots of config settings.) There are certain PDFs which, for reasons we cannot determine, cause all kinds of problems, and this is one of them. (The biggest problem is enormous amounts of time extracting the images.) We're just trying to work around the problem. I think the cause is somewhere in the JAI API, and it's far too low-level to consider digging into. But thank you for the code above and your responses.Vanderbilt
C
3

Try the code bellow, ContentHandler turned has your xml content.

public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{           

    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    ContentHandler handler =   new ToXMLContentHandler();
    PDFParser parser = new PDFParser(); 

    PDFParserConfig config = new PDFParserConfig();
    config.setExtractInlineImages(true);
    config.setExtractUniqueInlineImagesOnly(true);

    parser.setPDFParserConfig(config);


    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.copy(stream, outputFile);
        }
    };

    context.set(PDFParser.class, parser);
    context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );

    try (InputStream stream = new ByteArrayInputStream(content)) {
        parser.parse(stream, handler, metadata, context);
    }

    return handler;
}
Coursing answered 24/11, 2017 at 11:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.