Tesseract running error
Asked Answered
R

22

132

I have a problem with running tesseract-ocr engine on linux. I've downloaded RUS language data and put it to tessdata directory (/usr/local/share/tessdata). When I'm trying to run tesseract with command tesseract blob.jpg out -l rus , it displays an error:

Error opening data file /usr/local/share/tessdata/eng.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

Failed loading language eng
Tesseract couldn't load any languages!

Could not initialize tesseract.

According to compiling guide, I used export TESSDATA_PREFIX='/usr/local/share/' to point my tessdata directory. Maybe I should edit any config files? Tesseract try to load 'eng' data files instead of 'rus'.

Screenshot: https://i.stack.imgur.com/I0Guc.png

Rozalin answered 10/2, 2013 at 17:53 Comment(0)
F
135

You can grab eng.traineddata Github:

wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata

Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data.

When you grab the file(s), move them to the /usr/local/share/tessdata folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata instead.

# If you got the data from Google, unzip it first!
gunzip eng.traineddata.gz 
# Move the data
sudo mv -v eng.traineddata /usr/local/share/tessdata/
Foreboding answered 2/4, 2014 at 4:58 Comment(11)
correct me if I'm wrong.. but wasn't the question about including a new language (rus) and not supplying the one which the (faulty) error message points to?Ruwenzori
edit: For some reason, tesseract won't run unless eng.traineddate is present - even if it is not needed. So AAAfarmclub's answer is fine.Ruwenzori
Warning: other linux installations (ubuntu vivid) work in a different directory: /usr/share/tesseract-ocr/tessdata insteadRoubaix
@Gazta: Yes, openSUSE requires that directory insteadHispania
In Ubuntu-Gnome 16.04: it's /usr/share/tesseract-ocr/tessdata/Plast
Well you can put your traineddata anywhere in your system,,just make sure you are setting path till tessdata , say you downloaded you traineddata in /home/user/Downloads/tessdata/eng.traineddata thenTESSDATA_PREFIX should be /home/user/Downloads/tessdata/Vise
Arch Linux: /usr/share/tessdata/ The *.traineddata can be installed by means pacman. I just haven't found the path, but now it's OK.Kingston
in ubuntu 18.04 by default is expected on sudo mv -v eng.traineddata /usr/local/share/tessdata/Anglice
Thanks, it helps me a lot. Now in 2022, the correct download url is: wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata and the directory for Ubuntu 20.04 is sudo mv -v eng.traineddata /usr/share/tesseract-ocr/4.00/tessdataMarko
Ubuntu 22: sudo apt-get install libtesseract-dev tesseract-ocr-engZebulen
If you installed tesseract through homebrew on a Mac, consider: /opt/homebrew/share/tessdataLyndes
H
104

The simpliest way is to install the needed package:

sudo apt-get install tesseract-ocr-eng  #for english
sudo apt-get install tesseract-ocr-tam  #for tamil
sudo apt-get install tesseract-ocr-deu  #for deutsch (German)

As you can notice, it opens the road to others languages (i.e. tesseract-ocr-fra).

Hypocrite answered 30/3, 2016 at 12:49 Comment(7)
This should be the accepted answer. Manually tinkering with files behind the package managers back (assuming you used one to install tesseract in the first place) is a bad ideaPrognosis
For Mac user using MacPort: sudo port install tesseract-engSagittal
You can use tesseract --list-langs to see all available languages. You also can use sudo apt-get install tesseract-ocr-* to install all of themVisby
for arch the packages are called tesseract-data-eng, etc.Expressionism
unfortunately this answer doesn't work in a Docker container although it works outside of it.Aric
Trying this command but getting an error - sudo: apt-get: command not foundGriskin
These packages are the fast models, not best, for these download from github.com/tesseract-ocr/tessdata_best and put file to tesssdata dir of installation, works also in docker, e.g.: RUN cd /usr/share/tesseract-ocr/5/tessdata && curl -O https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/main/eng.traineddata with python:3.10-slim-bookwormWhisenhunt
H
45

I had this error too on the Windows machine.

My solution.

1) Download your language files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00

For example, for eng, I downloaded all files with eng prefix.

2) Put them into tessdata directory inside of some folder. Add this folder into System Path variables as TESSDATA_PREFIX.

Result will be System env var: TESSDATA_PREFIX=D:/Java/OCR And OCR folder has tessdata with languages files.

This is a screenshot of the directory:

enter image description here

Heydon answered 10/9, 2017 at 20:15 Comment(1)
Yes, everybody is talking about linux. Please guys, don't forget one more popular OS is there in market.Pilocarpine
C
7

For me the problem was in how I downloaded the train data files. Make sure you get the raw link.

Initially I was using:

wget https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata

When I changed it to:

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

It worked

Charlena answered 29/7, 2021 at 15:18 Comment(2)
That will not work. You need to change master by main in your URL.Eroto
this fixed my issues! Thanks! damn I had not realized I downloaded the JSON blob instead of the actual model....Hereditament
A
5

No previous solution worked for me.

I've installed both by apt-get and manually downloading the tessdata, moved around /usr and so on and no one worked even if i exported the variable thousand times.

Finally, on a last try before start to cry i've tried to pass the path directly to the instance of Tesseract().

In Python: tr = Tesseract("/usr/local/share/tesseract-ocr/") and now it works. To clarify, im using tesserwrap module.

Adrianadriana answered 8/8, 2016 at 5:21 Comment(1)
I feel you! I am exactly there right now. The difference is, to make things worse, I am trying to get it working from command line.Ama
A
5
tesseract  --tessdata-dir <tessdata-folder> <image-path> stdout --oem 2 -l <lng>

In my case, the mistakes that I've made or attempts that wasn't a success.

  • I cloned the github repo and copied files from there to
    • /usr/local/share/tessdata/
    • /usr/share/tesseract-ocr/tessdata/
    • /usr/share/tessdata/
  • Used TESSDATA_PREFIX with above paths
  • sudo apt-get install tesseract-ocr-eng

First 2 attempts did not worked because, the files from git clone did not worked for the reasons that I do not know. I am not sure why #3 attempt worked for me.

Finally,

  1. I downloaded the eng.traindata file using wget
  2. Copied it to some directory
  3. Used --tessdata-dir with directory name

Take away for me is to learn the tool well & make use of it, rather than relying on package manager installation & directories

Ama answered 28/11, 2018 at 9:56 Comment(0)
Y
5

For Ubuntu just run the below command and the Environment variable error will disappear.

command:

export TESSDATA_PREFIX=Path_of_your_tessdata_folder

Command Example:

export TESSDATA_PREFIX=/home/amar/Desktop/OCR/tesseract-4.1.1/tessdata

This command will set the tessdata folder's path to the environment variable with name TESSDATA_PREFIX and the above error will be resolved.

Yanirayank answered 6/4, 2021 at 8:32 Comment(0)
B
4

For Windows Users:

In Environment Variables, add a new variable in system variable with name "TESSDATA_PREFIX" and value is "C:\Program Files (x86)\Tesseract-OCR\tessdata"

Brady answered 26/3, 2020 at 5:13 Comment(0)
C
3

In Google Colab I resolved the issue in this way:

!sudo apt-get install tesseract-ocr-*

Because if you use this command !sudo apt install tesseract-ocr then it imports 2 languages but when you intend to work on non-English languages then the former command works. Afterwards, use this command !pip install pytesseract You can also check languages in this way !tesseract --list-langs

Chimene answered 12/3, 2022 at 7:43 Comment(1)
this should be the proper answer (to install the languages via the package manager and not manually download and moving the trained data to the repo, which I tried in Ubuntu 22.04 and was still getting the same message, even though the file was there with the proper permissions, and list-langs command was outputting the language I was trying to use)Slab
S
2

You can call tesseract API function from C code:

#include <tesseract/baseapi.h>
#include <tesseract/ocrclass.h>; // ETEXT_DESC

using namespace tesseract;

class TessAPI : public TessBaseAPI {
    public:
    void PrintRects(int len);
};

...
TessAPI *api = new TessAPI();
int res = api->Init(NULL, "rus");
api->SetAccuracyVSpeed(AVS_MOST_ACCURATE);
api->SetImage(data, w0, h0, bpp, stride);
api->SetRectangle(x0,y0,w0,h0);

char *text;
ETEXT_DESC monitor;
api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
printf("m.count: %s\n", monitor.count);
printf("m.progress: %s\n", monitor.progress);

api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
...
api->End();

And build this code:

g++ -g -I. -I/usr/local/include -o _test test.cpp -ltesseract_api -lfreeimageplus

(i need FreeImage for picture loading)

Stanislas answered 13/2, 2013 at 12:32 Comment(1)
@DarkSkull, yes, this is C++ code tested in Debian GNU/Linux. As you see, Russel Crowe have the problem with function TessAPI::Init(NULL, "rus"). It is meaningful to inspect the Tesseract source code (TessAPI class method).Stanislas
D
2

I'm using windows OS, I tried all solutions above and none of them work.

Finally, I install Tesseract-OCR on D drive(Where I run my python script from) instead of C drive and it works.

So, if you are using windows, run your python script in the same drive as your Tesseract-OCR.

Drear answered 18/4, 2019 at 22:4 Comment(0)
K
1

I'm using Visual Studio 2017 Community Edition.
I solved this problem by making a directory called tessdata in the Debug directory of my project. Then I put the eng.traineddata file into said directory.

Kaka answered 27/12, 2017 at 20:3 Comment(0)
M
1

C# developer working on Windows here. What works for me is simply download the file eng.traineddata from the following URL:

https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

and copy it to the following directory in my Console Application project:

[Project Directory]\bin\Debug\tessdata

I did manually create the tessdata folder above.

Mele answered 8/5, 2020 at 13:49 Comment(0)
E
1

I had the same problem with DEU language on macOS. I could solve it by installing all additional languages like so:

brew install tesseract-lang

as suggested on https://formulae.brew.sh/formula/tesseract

Exorcise answered 11/9, 2021 at 22:38 Comment(0)
S
0
tessdata_dir_config = r'--tessdata-dir "/usr/local/Cellar/tesseract/4.1.1/share/tessdata"'
pytesseract.image_to_string(imgCrop,lang='eng',config=tessdata_dir_config)
Scantling answered 20/10, 2020 at 21:38 Comment(1)
This isn't an answer. Please explain what is going on here so others can gain useful insights when reading your response.Seismograph
F
0

Add this to your code :

instance.setDatapath("C:\\somepath\\tessdata");

instance.setLanguage("eng");
Filibeg answered 17/1, 2021 at 20:56 Comment(0)
S
0

How I solved the problem in my Manjaro Xfce:

Message “TesseractError: (1, 'Error opening data file /home/julio/snap/tesseract/common/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')”

Then, in my Manjaro, I typed: sudo pacman -S tesseract Then the system installed both the “tesseract” and also a package name “leptonica”

After this step, I thought everything was ok, and tried to run my simple script. However, the error message changed to something like this (it changed the previous “/home” location to other “/usr”-like location): “"Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"”

Then I realized that there had appeared this message when I installed “tesseract” with pacman: “You must install one of tesseract-data-* packages or whole tesseract-data group”

So, I tried the command: “sudo pacman -S tesseract-data”, and the system presented lots of language options to me. So I’ve chosen some languages, installed as follows, and the module started to work like a charm:

sudo pacman -S tesseract-data-eng

sudo pacman -S tesseract-data-por

sudo pacman -S tesseract-data-fra

sudo pacman -S tesseract-data-spa

I tried some portuguese special characters (like "ão"), that only worked when I used the argument "lang='por'" in the pytesseract.image_to_string(img,lang='por')

Shush answered 19/3, 2021 at 21:25 Comment(0)
H
0

As of 2021, My solution for Ubuntu is to download the zip files from https://github.com/tesseract-ocr/tessdata_best/releases/tag/4.1.0, extract and copy the neccessary .traineddata files into /usr/local/share/tessdata. This is the default folder for tesseract 4.1.1 to search for trained data.

Hallerson answered 18/5, 2021 at 8:3 Comment(1)
@innovationism By default the files are present here /usr/share/tesseract-ocr/4.00/tessdata/ on Ubuntu 20.04. Tesseract is version 4.1.1. And on placing tessdata_best files here tesseract throws an error. How did you resolve this issue, in case you faced it?Stocks
T
0

**IF you have windows OS then please add your TesseractOCR to system variable. Eg..

  1. Find the path where Tesseract is installed in your c drive (in my case r"C:\Program Files\Tesseract-OCR\tesseract.exe")** 2)make sure you have the required files ie tessdata, tessdata if not then download it from https://github.com/tesseract-ocr/tessdata https://github.com/tesseract-ocr/langdata (At least those languages which you want to convert)
  2. past it into the main directory in my case C:\Program Files\Tesseract-OCR 4)Add the path of the directory to your system environment variable for that
    search environment variable in start bar go to environment variable click path in your system environment variable (NOT IN USER ENVIRONMENT VARIABLE) past the path of tesseractocr

thats all...

Trichomonad answered 4/1, 2022 at 11:2 Comment(0)
B
0

For windows , all you need to do is

  1. download the language you want
  2. store that in any folder preferably in tessdata
  3. set the system (not user) environment variable in advanced system settings (eg: C:\Userseclipse-workspace\tessdata) variable name : TESSDATA_PREFIX variable value: (eg:C:\Userseclipse-workspace\tessdata)
  4. restart the your laptop
Brashy answered 5/3, 2023 at 10:0 Comment(0)
F
0

On macOS, installed via macports:

After installing tesseract run sudo port install tesseract-osd to enable the 'osd'. This solved the error for me.

Fetlock answered 22/11, 2023 at 5:33 Comment(0)
M
0

Just adding some options to the list -- on RHEL and related systems, sudo dnf [or yum] install tesseract installs tesseract-langpack-eng by default. It looks like others may have to be downloaded manually.

Mauldin answered 8/12, 2023 at 20:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.