Apache Tika extract scanned PDF files
Asked Answered
M

1

9

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

public String extractText(InputStream stream) {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;
}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImages method of the PDFParserConfig class but this didn't change a thing. Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractor did extract embedded resources of a doc file but not for my PDF files.

It would be awesome if anyone of you could provide some help :)

Metalepsis answered 2/9, 2015 at 13:13 Comment(11)
Did you attach a PDFParserConfig to the context with that option set?Metaphrase
Yes, i did. But this had no effect :/Metalepsis
Can you post the code you used to do that, so we can check if it's correct?Metaphrase
PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); ParseContext context = new ParseContext(); context.set(PDFParserConfig.class, config); PDFParser pdfParser = new PDFParser(); pdfParser.setPDFParserConfig(config); pdfParser.parse(stream, handler, metadata, context); There you go, thanks for the help so far :)Metalepsis
Does running the Tika App with the -z (extract) flag get the scanned images out of the file?Metaphrase
Sadly it doesn't. BTW: I'm using the PDF mentioned in the TIKA Ticket about OCR Embedded Images which you can find here: Ticket, PDFMetalepsis
I'd suggest you raise a new Tika JIRA then, and refer to that file + what you've tried + a unit test that shows the issue. You seem to have done everything that I'd expect you to need to have done!Metaphrase
I created a ticket in the official Apache TIKA-JIRA. Everyone interested on updates can take a look here.Metalepsis
Is it working for you without Tesseract being installed ?Edgaredgard
No, it needs Tesseract.Metalepsis
It is better to write your solution here so everybody could use it.Edgaredgard
M
15

Tim Allison brought the solution:

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

EDIT: Here is the complete solution:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

Maven Dependencies:

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>
Metalepsis answered 15/9, 2015 at 11:50 Comment(4)
I have tried the solution and followed Apache Tika-Jira but its not working. I am not getting any error but output is empty.Stenophyllous
My issue got solved. Follow : #39763341Stenophyllous
Thamme, thank you for this. Please update to include the following dependency (thanks to Rana's link above) and a warning about licensing implications of levigo and jai. <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.3.1</version> </dependency>Uvulitis
Hi I used above code and I found that there is no difference in extract result whether i inclued tesseract or not. can you tell me why tesseract is being used. Thanks in advance.Clown

© 2022 - 2024 — McMap. All rights reserved.