Unable to extract scanned pdf using TesseractOCRConfig Apache Tika
Asked Answered
G

1

2

My pdf contains scanned images and I want to extract text from it.

What I tried : I tried with AutoDetectParsers but no output.

I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error.

My configuration : Win 7 64-bit OS, JDK 1.8.0_45.

Any kind of help is welcome.

Geniegenii answered 29/9, 2016 at 6:23 Comment(9)
Do you have Tesseract installed and at the location given in your config? Did you try following the Tika Troubleshooting Guide?Popish
@Popish I am using maven to install all jars which includes Tesseract. I have taken a look at the Troubleshooting guide for No Content Extracted problem. I have used the most recent version(1.13) of the Apache-tika-app.jar and tried to use the GUI to check the extraction but no output.Geniegenii
Tesseract is not a Java library, so Maven won't help you. You need to download and install the native program for your operating systemPopish
@Popish I don't want to use any software for this. I want to use TesseractOCR java api which can be used inside my java application. Anyway just for fun, I installed tesseract desktop app and tried my pdf, its extracting some incorrect words.Geniegenii
Tesseract is a native program you have to download and install separately. All Tika ships is the appropriate wrappers around Tesseract to enable it to be used if installedPopish
@Popish Can I exctract text from a scanned pdf without installing any native program in my system? If No, then it will drag me into a dependency of a native program to run my java application which I want to avoid.Geniegenii
Try softwarerecs.stackexchange.comPopish
@Popish Thanks for your help. I have installed Tesseract and tried to run tesseract from tika using new TesseractOCRConfig().setTesseractPath(tesseractFolder);. I can easily extract text from images, pdf containing single image but not from pdfs where multiple images are present. I am not getting any error but no output.Geniegenii
This helped me link in solving the issue. The issue was : Tika dropped support for extracting TIFF images from PDFs in 1.13 and for that we need to add one more dependency <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.3.1</version> </dependency>. Thanks.Geniegenii
G
8

Steps to follow to solve this :

  1. Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its location in your config.

    Java code :

    Parser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    TesseractOCRConfig config = new TesseractOCRConfig();
    config.setTesseractPath(tPath);
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
    ParseContext parseContext = new ParseContext();
    parseContext.set(TesseractOCRConfig.class, config);
    parseContext.set(PDFParserConfig.class, pdfConfig);
    //need to add this to make sure recursive parsing happens!
    parseContext.set(Parser.class, parser);
    
  2. Maven dependencies :

<dependencies> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.13</version> </dependency> <dependency> <groupId>com.levigo.jbig2</groupId> <artifactId>levigo-jbig2-imageio</artifactId> <version>1.6.5</version> </dependency> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.3.1</version> </dependency> </dependencies>

I think it may be helpful. Thanks.

Geniegenii answered 30/9, 2016 at 13:9 Comment(1)
Thank you for this. Beware of the licensing implications of using levigo and jai. If they were Apache 2.0 compatible, we would have embedded them.Thrill

© 2022 - 2024 — McMap. All rights reserved.