Azure Computer Vision API - OCR to Text on PDF files

Asked 28/9, 2018 at 15:47 Answered 17/9, 2020 at 12:17

I'm attempting to leverage the Computer Vision API to OCR a PDF file that is a scanned document but is treated as an image PDF.

I've tested it and it tells me that the PDF is "InvalidImageFormat", "Input data is not a valid image". When I test it on a PNG, it works perfectly.

Is there anyway to use the API against a PDF image or is there an Azure API that I could use in conjunction to go PDF > PNG > Text?

Tutelage answered 28/9, 2018 at 15:47 Comment(0)

Edit

Since answering additional services have become available, although I have not personally tried some of them, they may suit this purpose.

https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-intro

And at some point in the future when It goes GA. https://aws.amazon.com/textract/

Original Answer

Unfortunately Azure has no PDF integration for it's Computer Vision API. To make use of Azure Computer Vision you would need to change the pdf to an image (JPG, PNG, BMP, GIF) yourself.

Google do now offer pdf integration and I have been seeing some really good results from it from my testing so far.

This is done through the asyncBatchAnnotateFiles Method of the vision Client (I have been using the NodeJS Variant of the API)

It can handle files up to 2000 pages, Results are divided up into 20 page segments and output to Google Cloud Storage.

https://cloud.google.com/vision/docs/pdf

Schilling answered 30/10, 2018 at 14:56 Comment(1)

It seems that Azure can OCR pdf now: For PDF and TIFF files, up to 2000 pages (only first two pages for the free tier) are processed. learn.microsoft.com/en-us/azure/cognitive-services/… – Geralyngeraniaceous 14/4, 2021 at 15:27

The latest OCR service offered recently by Microsoft Azure is called Recognize Text, which significantly outperforms the previous OCR engine. Recognize Text can now be used with Read, which reads and digitizes PDF documents up to 200 pages.

Chet answered 15/3, 2019 at 19:54 Comment(2)

Indeed, awesome news! multi-page TIFFs (i.e. faxes) too – Volnak 8/4, 2019 at 10:38

Recognize Text has now been deprecated. Read supersedes it and adds considerable functionality: Upgrade guide Read API spec – Supersaturate 17/11, 2020 at 0:6

There is a new cognitive service API called Azure Form Recognizer (currently in preview - November 2019) available, that should do the job:

https://azure.microsoft.com/en-gb/services/cognitive-services/form-recognizer/

It can process the file formats you wanted:

Format must be JPG, PNG, or PDF (text or scanned). Text-embedded PDFs are best because there's no possibility of error in character extraction and location.

https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/overview

Here is the link the official Form Recognizer API documentation:

https://westus2.dev.cognitive.microsoft.com/docs/services/form-recognizer-api/operations/AnalyzeWithCustomModel

Note:

Form Recognizer is currently available in English, with additional language availability growing (4.12.2019)
Form Recognizer is available in the following Azure regions (4.12.2019): Canada Central, North Europe, West Europe, UK South, Central US, East US, East US 2, South Central US, West US https://azure.microsoft.com/en-in/global-infrastructure/services/?products=cognitive-services

Citron answered 11/11, 2019 at 8:53 Comment(1)

Some of the limitations in PDF: Number of pages should be less than 50 pages, no radio button or check boxes and it won't support for complex tables. – Veliger 4/12, 2019 at 4:39

Sorry you have to break the PDF pages into images (JPG and PNGs). Then send the images over to Computer Vision. It is also a good idea to break it down so that you don't have to OCR all pages, only the ones that have importance.

Tonguelash answered 26/11, 2018 at 7:13 Comment(0)

There is a new Read API to work with PDF https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text

Computer Vision’s Read API is Microsoft’s latest OCR technology that extracts
printed text (seven languages), handwritten text (English only), digits, and 
currency symbols from images and multi-page PDF documents.

Read API reference: https://westcentralus.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-ga/operations/5d986960601faab4bf452005

It works well enough, but does not have a lot of languages yet.

Faintheart answered 23/7, 2020 at 15:13 Comment(0)

You can convert the pdf to images for each page using fitz.

# import packages
import fitz
import numpy as np
import cv2

#set path to pdf
path2doc = <path to pdf>

#open pdf with fitz
doc = fitz.open(path2doc)

# determine number of pages
pagecount = doc.pageCount

# loop over all pages and convert to image (here jpeg)
for i in range(pagecount):
    page = doc[i]
    pix = page.getPixmap().getImageData(output='JPEG')
    jpg_as_np = np.frombuffer(pix, dtype=np.uint8)
    image = cv2.imdecode(jpg_as_np, flags=1)

Once this is done, you can send them to the API

Kaleighkalends answered 17/9, 2020 at 12:17 Comment(0)

Recommended topics

Hot tags