I'm using Tess4J (JNA wrapper around tesseract), and trying to call tess.doOCR(myFile)
to OCR text from a single-page PDF.
I have GhostScript installed (by using yum install ghostscript
), gs -h
works correctly.
My app server is using 64-bit JVM
, and I have gsdll64.dll
, and the 64-bit tesseract dll's liblept168.dll
and libtesseract302.dll
in the class path.
When tess.doOCR(myFile)
is called, this is logged:
GPL Ghostscript 8.70 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
But then it just stops there. The program doesn't go any further.
UPDATE --
It looks like the real issue is from this error:
java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract.so) not found in resource path
After looking around a lot, I don't see a convenient place to find this libtesseract.so
file, and I'm not sure what it takes to get this onto my Linux app server. I read that maybe I need to download some C++ runtime, but I don't see a Linux download for that. Any advice would be much appreciated.
Or is this something to do with a symbolic link?
yum
package manager (on some kind of RedHat or something), and tesseract-ocr was not a convenient download. Recalling, it was a nightmare to get it to work without having it available through package management. I definitely think switching to Ubuntu or something debian (withapt-get
) makes life a lot easier to get tesseract working... – Inflated