How to use the Amazon Textract with PDF files

Asked 25/11, 2019 at 18:46 Answered 27/9, 2023 at 13:29

amazon-web-services ocr text-extraction amazon-textract

I already can use the textract but with JPEG files. I would like to use it with PDF files.

I have the code bellow:

import boto3

# Document
documentName = "Path to document in JPEG"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract')
documentText = ""

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        documentText = documentText + item["Text"]

        # print('\033[94m' +  item["Text"] + '\033[0m')
        # # print(item["Text"])

# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)

As I said, it works fine. But I would like to use it passing a PDF file as in the web application for tests.

I know it possible to convert the PDF to JPEG in python but it would be nice to do it with PDF. I read the documentation and do not find the answer.

How can I do that?

EDIT 1: I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket.

Jaclynjaco answered 25/11, 2019 at 18:46 Comment(0)

As @syumaK mentioned, you need to upload the pdf to S3 first. However, doing this may be cheaper and easier than you think:

Create new S3 bucket in console and write down bucket name, then

import random
import boto3

bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'

s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)

client = boto3.client('textract')
response = client.start_document_text_detection(
                   DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
                   ClientRequestToken=random.randint(1,1e10))

jobid = response['JobId']
response = client.get_document_text_detection(JobId=jobid)

It may take 5-50 seconds, until the call to get_document_text_detection(...) returns a result. Before, it will say that it is still processing.

According to my understanding, for each token, exactly one paid API call will be performed - and a past one will be retrieved, if the token has appeared in the past.

Edit: I forgot to mention, that there is one intricacy if the document is large, in which case the result may need to be stitched together from multiple 'pages'. The kind of code you will need to add is


...
pages = [response]
while nextToken := response.get('NextToken'):
    response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
    pages.append(response)

Palmar answered 7/8, 2020 at 8:56 Comment(1)

Thank you so much for the edit. I did not know about NextToken thing and never came across it while implementing... this is what happends when you don't read the entire documentation :'( I am googling this thing for past couple of days, as to why is textract not scanning my entire document when am using boto3 :3 – Basenji 27/4, 2022 at 5:49

As mentioned in the AWS Textract FAQ page https://aws.amazon.com/textract/faqs/. pdf files are supported and in Sdk as well https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html

Sample usage https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/12-pdf-text.py

Dewain answered 25/11, 2019 at 19:1 Comment(1)

I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket. In that script you sended to me, I would have to use s3 bucket. Right? – Jaclynjaco 25/11, 2019 at 19:5

The easiest and most transparent way to process pdf files with Textract is to use the amazon-textract-textractor library. It calls the asynchronous function and creates a lazy-loaded document object that gets automatically filled when the asynchronous job completes.

from textractor import Textractor

extractor = Textractor(profile_name="default")
document = extractor.start_document_text_detection(
    "./multipage.pdf",
    s3_upload_path="s3://<YOUR BUCKET HERE>",
)

for page in document.pages:
    print(page.lines)

For example with this multipage pdf you would get:

[INVOICE, 00000135, Invoice No:, 12 March 2020, Invoice Date:, - META -, F0016, Purchase Order:, LEGAL & FINANCE, Abstractors and Design Co., Attn: Ronald Davis, Suite 8, 611 Maine St, San Francisco CA 94105, Item, Amount, Qty, 3, $1,350.00, ACP101 Accounting Package, Annual Subscription to Premier Version with Tax,, Inventory and Payroll Plugins, 4.5, $495.00, ACP101T Online Training, Hours of Training in Premier Version - Interactive, Demos with Q&A Sessions, 10, $1,100.00, ACP101S Standard Support, Initial Hours allocated for access to email and phone, support for Premier Version, 6, ACP101C Screen Customization, $660.00, Hours spent customizing screens in Premier Version, for client requirements, 4.5, $495.00, ACP101R Report Customization, Hours spent customizing reports in Premier Version, for client requirements, 154-164 The Embarcadero, San Francisco, CA 94105, Tel: (1) 555-123-1234 Email: [email protected]]
[INVOICE, 00000135, Invoice No:, 12 March 2020, Invoice Date:, - META -, F0016, Purchase Order:, LEGAL & FINANCE, Item, Amount, Qty, 2, $220.00, ACP101I System Imports, Hours spent importing customer records into Premier, Version, 3, $900.00, ACP100 Accounting Package, Annual Subscription to Standard Version of Accounts, System, 4.5, $495.00, ACP100T Online Training, Hours of Training in Standard Version - Interactive, Demos with Q&A Sessions, Total:, $5,715.00, Payment Terms: 14 days, Payment Due By: Thursday, 26 March 2020, 154-164 The Embarcadero, San Francisco, CA 94105, Tel: (1) 555-123-1234 Email: [email protected]]

Tupungato answered 3/9, 2023 at 21:7 Comment(0)

It is easy and cheap to keep the pdf in S3 bucket and uses with Textract. If the pdf has multiple pages, it is better to use async function which can be used to trigger sns topic and subscribes to that SNS topic. That is the good use case of serverless app.

I found this video that shows step by step guide and has full source code for as reference.

Video: https://www.youtube.com/watch?v=BNnFfTZsmjc

Source code: https://github.com/CodeSam621/Demo/tree/main/TextractAsync

Decare answered 27/9, 2023 at 13:29 Comment(0)

Since you want to work with PDF files meaning that you'll utilize Amazon Textract Asynchronous API (StartDocumentAnalysis, StartDocumentTextDetection) then currently it's not possible to directly parse in PDF files. This is because Amazon Textract Asynchronous APIs only support document location as S3 objects.

From AWS Textract doc:

Amazon Textract currently supports PNG, JPEG, and PDF formats. For synchronous APIs, you can submit images either as an S3 object or as a byte array. For asynchronous APIs, you can submit S3 objects.

Verruca answered 28/11, 2019 at 18:58 Comment(0)

Upload the pdf to S3 bucket. After that, you can use easily use available functions startDocumentAnalysis to fetch pdf directly from s3 and do textract.

Terribly answered 17/10, 2021 at 17:17 Comment(0)

It works (almost), I had to make ClientRequestToken a string instead of an integer.

Hubbard answered 28/6, 2022 at 19:18 Comment(0)

You can textract pdf file page by page without using s3 bucket

For this task you need to cut the page from the pdf

I used ghostscript command example:

gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=100 -dPDFFitPage -dDEVICEWIDTHPOINTS=3000 -dDEVICEHEIGHTPOINTS=3000 -dFirstPage=1 -dLastPage=1 -sOutputFile=[outputFilePath] [inputFileName]

this is the sample code for golang:

// GenerateImageByPDF gets a pdf file and a page number as intput, then generate an image file of that page

func GenerateImageByPDF(inputFileName string, outputFileName string, pageNumber int) error {
    dFirstPageFlag := fmt.Sprintf("-dFirstPage=%d", pageNumber)
        dLastPageFlag := fmt.Sprintf("-dLastPage=%d", pageNumber)
    outputFileFlag := fmt.Sprintf("-sOutputFile=%s", outputFileName)

    cmd := exec.Command("gs", "-dSAFER", "-dBATCH", "-dNOPAUSE", "-sDEVICE=jpeg", "-dJPEGQ=100", "-dPDFFitPage","-dDEVICEWIDTHPOINTS=3000", "-dDEVICEHEIGHTPOINTS=3000", dFirstPageFlag, dLastPageFlag, outputFileFlag, inputFileName)

    output, err := cmd.Output()
    if err != nil {
        return err
    }

    if strings.Contains(string(output), "Requested FirstPage is greater than the number of pages in the file") {
        return fmt.Errorf("requested page is greater than the number of pages in file")
    }
    // TODO (later development) : if the otuput size is more than 10MB we're gonna have problem with amazon textract service so you can use -dDownScaleFactor=2 to shorten the fize size and do it until it's less than 10MB
    return nil
}

Instauration answered 22/5, 2023 at 7:20 Comment(0)

Recommended topics

Hot tags