Unable to extract scanned pdf using TesseractOCRConfig Apache Tika - McMap

About

Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

Asked 29/9, 2016 at 6:23 Answered 30/9, 2016 at 13:9

java parsing pdf ocr apache-tika

G

1

2

My pdf contains scanned images and I want to extract text from it.

What I tried : I tried with AutoDetectParsers but no output.

I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error.

My configuration : Win 7 64-bit OS, JDK 1.8.0_45.

Any kind of help is welcome.

Geniegenii answered 29/9, 2016 at 6:23 Comment(9)

Do you have Tesseract installed and at the location given in your config? Did you try following the Tika Troubleshooting Guide? – Popish 29/9, 2016 at 8:54

@Popish I am using maven to install all jars which includes Tesseract. I have taken a look at the Troubleshooting guide for No Content Extracted problem. I have used the most recent version(1.13) of the Apache-tika-app.jar and tried to use the GUI to check the extraction but no output. – Geniegenii 29/9, 2016 at 11:50

Tesseract is not a Java library, so Maven won't help you. You need to download and install the native program for your operating system – Popish 29/9, 2016 at 12:17

@Popish I don't want to use any software for this. I want to use TesseractOCR java api which can be used inside my java application. Anyway just for fun, I installed tesseract desktop app and tried my pdf, its extracting some incorrect words. – Geniegenii 29/9, 2016 at 12:35

Tesseract is a native program you have to download and install separately. All Tika ships is the appropriate wrappers around Tesseract to enable it to be used if installed – Popish 29/9, 2016 at 12:58

@Popish Can I exctract text from a scanned pdf without installing any native program in my system? If No, then it will drag me into a dependency of a native program to run my java application which I want to avoid. – Geniegenii 29/9, 2016 at 13:7

Try softwarerecs.stackexchange.com – Popish 29/9, 2016 at 14:0

@Popish Thanks for your help. I have installed Tesseract and tried to run tesseract from tika using new TesseractOCRConfig().setTesseractPath(tesseractFolder);. I can easily extract text from images, pdf containing single image but not from pdfs where multiple images are present. I am not getting any error but no output. – Geniegenii 30/9, 2016 at 9:45

This helped me link in solving the issue. The issue was : Tika dropped support for extracting TIFF images from PDFs in 1.13 and for that we need to add one more dependency

<dependency> 			<groupId>com.github.jai-imageio</groupId> 			<artifactId>jai-imageio-core</artifactId> 			<version>1.3.1</version> 		</dependency>

. Thanks. – Geniegenii 30/9, 2016 at 12:40

G

8

Steps to follow to solve this :

Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its location in your config.

Java code :

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath(tPath);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
//need to add this to make sure recursive parsing happens!
parseContext.set(Parser.class, parser);

Maven dependencies :

<dependencies> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.13</version> </dependency> <dependency> <groupId>com.levigo.jbig2</groupId> <artifactId>levigo-jbig2-imageio</artifactId> <version>1.6.5</version> </dependency> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.3.1</version> </dependency> </dependencies>

I think it may be helpful. Thanks.

Geniegenii answered 30/9, 2016 at 13:9 Comment(1)

Thank you for this. Beware of the licensing implications of using levigo and jai. If they were Apache 2.0 compatible, we would have embedded them. – Thrill 27/3, 2017 at 16:43

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.