Using AWS Textract for processing PDF
Asked Answered
C

1

0

I want to use Textract OCR service for reading text from pdf file. I have a problem with that because I want to do it locally, without S3 bucket. I tested it for image files and it works good, but it does not work for PDF files.

This is the code where I get an error:

response = textract.start_document_text_detection(DocumentLocation="sample2.pdf")

Error:

Invalid type for parameter DocumentLocation, value: sample2.pdf, type: <class 'str'>, valid types: <class 'dict'>

Code2:

response = textract.start_document_text_detection(DocumentLocation={"name":"sample2.pdf"})

Error:

Unknown parameter in DocumentLocation: "name", must be one of: S3Object

Code3:

response = textract.start_document_text_detection(Document={'Bytes': "sample2.pdf"})

Error:

Unknown parameter in input: "Document", must be one of: DocumentLocation, ClientRequestToken, JobTag, NotificationChannel, OutputConfig

What should I do, Is there a way to make Textract work for PDF documents without s3?

Copestone answered 8/10, 2020 at 10:52 Comment(1)
I was searching for the same. Yes, you can use AWS Textract on local files. But you have to convert the files (.pdf , .jpg ... ) into Bytes docs.aws.amazon.com/textract/latest/dg/API_Document.htmlSellers
S
1

The short answer to your question is "No."

Textract works with S3 only for input. y\You will need to follow the format for the expected input which is described for the service in the boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

Essentially, the service wants a structured input and you need to fill that in correctly according to their specification. Here's the DocumentLocation dictionary input expected by boto3.

DocumentLocation={
    'S3Object': {
        'Bucket': 'string',
        'Name': 'string',
        'Version': 'string'
    }
}

I'm having some similar issues getting this to work in boto3 currently as well, but i will keep working thru the docs to see what i can figure out.

Situla answered 23/10, 2020 at 15:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.