Unsupported Document format while using Amazon Textract,
Asked Answered
C

2

23

When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported document format.

i am using Amazon textract with boto3. When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported do cument format. I am fairly new to this, in the documentation of textract it is mentioned that pdf files are indeed supported.

This is the code i am using.

import boto3
textractClient = boto3.client('textract',region_name='us-east-1')
response = textractClient.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucketName', 'Name': 'filename.pdf'}})
blocks = response['Blocks']

This gives me the error,Request has unsupported document format.

Casemate answered 18/7, 2019 at 7:8 Comment(0)
G
38

detect_document_text() is a synchronous API that only support PNG or JPG images.

If you'd like to process PDF files, you should use the asynchronous API called start_document_text_detection().

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

Gimble answered 19/7, 2019 at 0:2 Comment(3)
If the document is only one page long, there is no problem using detect_document_text(). I ask him, do you know how to specify that it only work with one page if the pdf has several pages?Risarise
detect_document_text() is a synchronous API that only support PNG or JPG images. The API docs say this supports PNG, JPG, PDF and TIFF so either this answer is not correct, or the formats changed after it was answered. It's more likely that multi-page PDF docs are rejected by the API.Baggywrinkle
This post can be improved to include single page PDF support.Autorotation
A
4

Textract synchronous APIs have been supporting single page PDFs for a while now.

So, either you could pre-split your document and make use of the sync API, or make use of the async API if using file directly.

Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/start_document_text_detection.html

Autorotation answered 24/6, 2023 at 20:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.