Using Textract for OCR locally
Asked Answered
P

1

5

I want to extract text from images using Python. (Tessaract lib does not work for me because it requires installation).

I have found boto3 lib and Textract, but I'm having trouble working with it. I'm still new to this. Can you tell me what I need to do in order to run my script correctly.

This is my code:

import cv2
import boto3
import textract


#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
    img = bytearray(document.read())

textract = boto3.client('textract',region_name='us-west-2')

response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
print(response)

When I run this code, I get:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

I have also tried this:

# Document
documentName = "slika2.jpg"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract',region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes}) #ERROR

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

But I get this error:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

Im noob in this, so any help would be good. How can I read text form my image or pdf file?

I have also added this block of code, but the error is still Unable to locate credentials.

session = boto3.Session(
    aws_access_key_id='xxxxxxxxxxxx',
    aws_secret_access_key='yyyyyyyyyyyyyyyyyyyyy'
)
Palmitin answered 24/9, 2020 at 10:57 Comment(8)
#33297672 see this can help you. As i can see you haven't set AWS profile.Tanager
Any help with this: #64101724Palmitin
@aviboy2006 Can you tell me what should I add to my code when I set up the AWS profile?Palmitin
If u set profile then check my first answer.Tanager
@aviboy2006 Sorry but that does not help me. Im still learning about aws and textract. I want to be able to read text from pdf or image wile. I have the code that I wrote above, so If you can, tell me exactly that I need to do, what should I add to my code, what should I remove etc.Palmitin
Maybe lets start from the begining. Do you have AWS account? If yes, how do you access it? Have you setup AWS CLI as shown here? Do you have programatic keys to access your account?Susie
Yes, I have installed awscli on my mac, i set my region, access key and secret access key, but when I run the program I get the error that my keys are not validPalmitin
github.com/aviboy2006/coding-challenge/blob/master/… try this.Tanager
H
6

There is problem in passing credentials to boto3. You have to pass the credentials while creating boto3 client.

import boto3

# boto3 client
client = boto3.client(
    'textract', 
    region_name='us-west-2', 
    aws_access_key_id='xxxxxxx', 
    aws_secret_access_key='xxxxxxx'
)

# Read image
with open('slika2.png', 'rb') as document:
    img = bytearray(document.read())

# Call Amazon Textract
response = client.detect_document_text(
    Document={'Bytes': img}
)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

Do note, it is not recommended to hardcode AWS Keys in code. Please refer following this document

https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html

Haematocele answered 8/10, 2020 at 3:22 Comment(4)
I've not tested for pdf, please try and let me know if there is any issue. :)Haematocele
Its giving the error, I do not know if I can do it without s3 bucketPalmitin
please check this question #64261511Palmitin
Yes, you are right. For PDF, you have use asynchronous method using S3. Workaround can be to convert pdf to images and then use textract. Let me know, if you need example for that.Haematocele

© 2022 - 2024 — McMap. All rights reserved.