Why GCP Vision API returns worse results in python than at its online demo
Asked Answered
B

1

5

I wrote a basic python script to call and use the GCP Vision API. My aim is to send an image of a product to it and to retrieve (with OCR) the words written on this box. I have a predefined list of brands so I can search within the returned text from the API the brand and detect what it is.

My python script is the following:

import  io
from google.cloud import vision
from google.cloud.vision import types
import os
import cv2
import numpy as np

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "**************************"


def detect_text(file):
    """Detects text in the file."""
    client = vision.ImageAnnotatorClient()

    with io.open(file, 'rb') as image_file:
        content = image_file.read()

    image = types.Image(content=content)

    response = client.text_detection(image=image)
    texts = response.text_annotations
    print('Texts:')

    for text in texts:
        print('\n"{}"'.format(text.description))

        vertices = (['({},{})'.format(vertex.x, vertex.y)
                    for vertex in text.bounding_poly.vertices])

        print('bounds: {}'.format(','.join(vertices)))


file_name = "Image.jpg"
img = cv2.imread(file_name)

detect_text(file_name)

For now, I am experimenting with the following product image: enter image description here (951 × 335 resolution)

Its brand is Acuvue.

The problem is the following. When I am testing the online demo of GCP Cloud Vision API then I am getting the following text result for this image:

FOR ASTIGMATISM 1-DAY ACUVUE MOIST WITH LACREON™ 30 Lenses BRAND CONTACT LENSES UV BLOCKING

(The json result for this returns all the above words including the word Acuvue which matters for me but the json is too long to post it here)

Therefore, the online demo detects pretty well the text on the product and at least it detects accurately the word Acuvue (which is the brand). However, when I am calling the same API in my python script with the same image I am getting the following result:

Texts:

"1.DAY
FOR ASTIGMATISM
WITH
LACREONTM
MOIS
30 Lenses
BRAND CONTACT LENSES
UV BLOCKING
"
bounds: (221,101),(887,101),(887,284),(221,284)

"1.DAY"
bounds: (221,101),(312,101),(312,125),(221,125)

"FOR"
bounds: (622,107),(657,107),(657,119),(622,119)

"ASTIGMATISM"
bounds: (664,107),(788,107),(788,119),(664,119)

"WITH"
bounds: (614,136),(647,136),(647,145),(614,145)

"LACREONTM"
bounds: (600,151),(711,146),(712,161),(601,166)

"MOIS"
bounds: (378,162),(525,153),(528,200),(381,209)

"30"
bounds: (614,177),(629,178),(629,188),(614,187)

"Lenses"
bounds: (634,178),(677,180),(677,189),(634,187)

"BRAND"
bounds: (361,210),(418,210),(418,218),(361,218)

"CONTACT"
bounds: (427,209),(505,209),(505,218),(427,218)

"LENSES"
bounds: (514,209),(576,209),(576,218),(514,218)

"UV"
bounds: (805,274),(823,274),(823,284),(805,284)

"BLOCKING"
bounds: (827,276),(887,276),(887,284),(827,284)

But this does not detect at all the word "Acuvue" as the demo does!!

Why is this happening?

Can I fix something in my python script to make it work properly?

Bryantbryanty answered 1/5, 2018 at 13:38 Comment(5)
Does the result change (in any meaningful way) when using a DOCUMENT_TEXT_DETECTION request instead of a TEXT_DETECTION request? (example)Plautus
Thank you for your comment. Do you mean replacing detect_text(file_name) with document_detect_text(file_name)? This gives me the following error: name 'document_detect_text' is not definedBryantbryanty
Both the response = ... line and texts = ... line should be changed using the document_text_detection method and full_text_annotation attribute as shown in the example I linked. My hope is that the more robust detector will find "Acuvue" but at a confidence that the standard detector considers too low to include.Plautus
That's pretty good! Now it returns a long json which at the end has the following: text: "FOR ASTIGMATISM\n1-DAY ACUVUE\nMOIST\nWITH\nLACREON\342\204\242\n30 Lenses\nBRAND CONTACT LENSES\nUV BLOCKING\n". However, I am wondering what these numbers `342\204\242` mean in it.Bryantbryanty
So you can write down an answer and preferably explain what was wrong in what I was doing and then I will tick it as correct if nothing changes.Bryantbryanty
P
7

From the docs:

The Vision API can detect and extract text from images. There are two annotation features that support OCR:

  • TEXT_DETECTION detects and extracts text from any image. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes.

  • DOCUMENT_TEXT_DETECTION also extracts text from an image, but the response is optimized for dense text and documents. The JSON includes page, block, paragraph, word, and break information.)

My hope was that the web API was actually using the latter, and then filtering the results based on the confidence.

A DOCUMENT_TEXT_DETECTION response includes additional layout information, such as page, block, paragraph, word, and break information, along with confidence scores for each.

At any rate, I was hoping (and my experience has been) that the latter method would "try harder" to find all the strings.

I don't think you were doing anything "wrong". There are just two parallel detection methods. One (DOCUMENT_TEXT_DETECTION) is more intense, optimized for documents (likely for straightened, aligned and evenly spaced lines), and gives more information that might be unnecessary for some applications.

So I suggested you modify your code following the Python example here.

Lastly, my guess is that the \342\204\242 you ask about are escaped octal values corresponding to utf-8 characters it thinks it found when trying to identify the ™ symbol.

If you use the following snippet:

b = b"\342\204\242"
s = b.decode('utf8')
print(s)

You'll be happy to see that it prints ™.

Plautus answered 1/5, 2018 at 15:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.