Tesseract ocr PDF as input

E

5

33

I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?

I have use ghostscript library to change Pdf to image then feed Tesseract with it and it's working great getting the text but i doesn't save the original shape of Pdf i only get text

how can i get text from Pdf with saving the shape of original Pdf

this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English

Elisabeth answered 15/4, 2015 at 17:48 Comment(3)

You'd need a library to turn a PDF into an Image. And then use that same library to create the searchable PDF. – Rothermere 15/4, 2015 at 17:50

which library is the best for this job and could you provide me with a sample to how to do this .. and i want to save the shape of the original pdf and add under it the text layer @Rothermere – Elisabeth 15/4, 2015 at 18:42

Removed unnecessary information, linked the outside link in-line and fixed grammar. This question requires 'what you've tried' (in terms of actual code) or it risks being downvoted into oblivion or closed. – Fluellen 15/4, 2015 at 23:3

S

27

Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf.

import pdf2image
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


def pdf_to_img(pdf_file):
    return pdf2image.convert_from_path(pdf_file)


def ocr_core(file):
    text = pytesseract.image_to_string(file)
    return text


def print_pages(pdf_file):
    images = pdf_to_img(pdf_file)
    for pg, img in enumerate(images):
        print(ocr_core(img))


print_pages('sample.pdf')

Screak answered 11/10, 2019 at 10:52 Comment(1)

keep in mind if you want to change the language you need to only change this line of code: text = pytesseract.image_to_string(file, lang='bul') in my case Bulgarian. Also check this post for more info: #44692329 – Succuss 6/10, 2021 at 8:39

H

21

There is a handy tool OCRmyPDF that will add a text layer to a scanned PDF making it searchable - which essentially automates the steps mentioned in previous answers.

Headpiece answered 6/9, 2021 at 18:54 Comment(0)

E

8

Tesseract supports the creation of sandwich since version 3.0. But 3.02 or 3.03 are recommended for this feature. Pdfsandwich is a script which does more or less what you want.

There is the online service www.sandwichpdf.com which does use tesseract for creating searchable PDFs. You might want to run a few tests before you start implementing your solution with tesseract. The results are ok, but there are commercial products which deliver better results. Disclosure: I am the creator of www.sandwichpdf.com.

Elohim answered 17/4, 2015 at 9:19 Comment(1)

tobltobs thank you did pdfsandwich support windows because i am coding using visual studio 2010 in windows 7 64 bit – Elisabeth 17/4, 2015 at 11:3

D

0

Use pdf2png.com, then upload your pdf, then it will make all png files of each page as <pdf_name>-<page_number>.png in .zip file,

Then, you can write simple python code as

#/usr/bin/python3
#coding:utf-8
import os
pdf_name = 'pdf_name'
language = 'language of tesseract'
for x in range(int('number of pdf_pages')):
    cmd = f'tesseract {pdf_mame}-{x}.png {x} -l {language}'
    os.system(cmd)

And then, read all the files such as from 1.txt to all the way up, and append to a single, file, it is as easy as that.

Debate answered 18/8, 2020 at 9:48 Comment(1)

question is for c# – Braunite 17/10, 2023 at 18:9

F

0

OCR is not able to resolve many situations that occur in a PDF. Especially when the file was already a mix of text and small image components as in this case (5 images spliced in the page).

So the best working result will be to do graphics content OCR in situ. Where we can correct for any distortions such as wrong font height and content. Look closely at the data right and left.

Foulard answered 29/1 at 16:13 Comment(1)

what does "do graphics content OCR in situ" mean? Do you mean convert to image then to pdf like other folks are saying? – Stymie 1/5 at 8:10

Recommended topics

Hot tags