How to extract text from a directory of PDF files efficiently with OCR?
Asked Answered
N

1

17

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:

import multiprocessing
import textract

def extract_txt(file_path):
    text = textract.process(file_path, method='tesseract')

p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))

However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.

Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.

Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.

UPDATE

Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:

from PyPDF2 import PdfFileWriter, PdfFileReader
import textract


def extract_text(pdf_file):
    inputpdf = PdfFileReader(open(pdf_file, "rb"))
    for i in range(inputpdf.numPages):
        w = PdfFileWriter()
        w.addPage(inputpdf.getPage(i))
        outfname = 'page{:03d}.pdf'.format(i)
        with open(outfname, 'wb') as outfile:  # I presume you need `wb`.
             w.write(outfile)
        print('\n<begin page pos =' , i, '>\n')
        text = textract.process(str(outfname), method='tesseract')
        os.remove(outfname)  # clean up.
        print(str(text, 'utf8'))
        print('\n<end page pos =' , i, '>\n')

extract_text('/Users/user/Downloads/ImageOnly.pdf')

However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:

sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname)  # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()

Any idea of how to make the page extraction/separator trick and saving everything into a file?...

Nabors answered 28/4, 2017 at 5:9 Comment(5)
All documents were failed to extract? Or only very slow to complete?Constant
Thanks for the help @Constant both. After 2 hours I got: [None]Nabors
Then have you tried to OCR a PDF in command line by Tesseract?Constant
I tried with a shorter file (33 pages) and still the same issue.... No, could you provide an example of how to do that?..@ConstantNabors
@johndoe Do not redirect the standard streams! Just open a file and write to it. Read the section on reading and writing files in the Python tutorial. It's not difficult.Carbonize
C
12

In your code, you are extracting the text, but you don't do anything with it.

Try something like this:

def extract_txt(file_path):
    text = textract.process(file_path, method='tesseract')
    outfn = file_path[:-4] + '.txt'  # assuming filenames end with '.pdf'
    with open(outfn, 'wb') as output_file:
        output_file.write(text)
    return file_path

This writes the text to file that has the same name but a .txt extension.

It also returns the path of the original file to let the parent know that this file is done.

So I would change the mapping code to:

p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
    print('completed file:', fn)
  • You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
  • Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
  • Because the worker function returned the filename, you can print it to let the user know that this file is done.

Edit 1:

The additional question is if it is possible to mark page boundaries. I think it is.

A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.

Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.


Edit 2:

If you need a file, write a file:

from PyPDF2 import PdfFileWriter, PdfFileReader
import textract

def extract_text(pdf_file):
    inputpdf = PdfFileReader(open(pdf_file, "rb"))
    outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
    with open(outfname, 'w') as textfile:
        for i in range(inputpdf.numPages):
            w = PdfFileWriter()
            w.addPage(inputpdf.getPage(i))
            outfname = 'page{:03d}.pdf'.format(i)
            with open(outfname, 'wb') as outfile:  # I presume you need `wb`.
                w.write(outfile)
            print('page', i)
            text = textract.process(outfname, method='tesseract')
            # Add header and footer.
            text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
            # Write the OCR-ed text to the output file.
            textfile.write(text)
            os.remove(outfname)  # clean up.
            print(text)
Carbonize answered 30/4, 2017 at 18:56 Comment(11)
Thanks roland, in ---> 13 for fn in p.imap_unordered(extract_txt, file_path): I got: TypeError: write() argument must be str, not bytesNabors
@johndoe Then you must decode the bytes or write the file in binary mode.Carbonize
After changing for wb this worked pretty well, and actually finished with a large document. Do you think it is possible to add separators for the page position?: <start/age = 1> ... page content ... <end/page = 1>Nabors
I tried to rewrite the snippet for the case of writing the prints to a file... however I got: MissingFileError: The file "<_io.BufferedWriter name='page000.pdf'>" can not be found. Is this the right path/to/file/you/want/to/extract.pdf'>? How should I manage writing each page to the same file, instead of printing it?...Nabors
I tried and still can not append the page position inside the outfname.Nabors
My solution already writes the extracted text for each page to its own file. It would be trivial to add text to the OCR output (using the + operator), but why would you want to to that? The file name already includes the page number...Carbonize
Thanks again roland... I actually tried both: + and join and I couldn't... also, when I run your solution and opened the output file the page separator wasn't in the the page..Nabors
The bounty will expire... I will award it to you... nevertheless could you fix the last part?.. Thanks!Nabors
When I runned it at first instance I got a TypeError: must be str, not bytes, in: ---> 16 text = '<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>'.format(i), then I tried to fixed it with: str(). Then I noted that if I pass a 2 pages document the script yields two files in the directory, instead of a single document. How to append everything in a single file instead of generating one file per page?.Nabors
Append the text from each page to a single file instead of writing it so separate files...Carbonize
Thanks for the help roland!Nabors

© 2022 - 2024 — McMap. All rights reserved.