Get Lines and Paragraphs, not symbols from Google Vision API OCR on PDF
Asked Answered
S

1

23

I am attempting to use the now supported PDF/TIFF Document Text Detection from the Google Cloud Vision API. Using their example code I am able to submit a PDF and receive back a JSON object with the extracted text. My issue is that the JSON file that is saved to GCS only contains bounding boxes and text for "symbols", i.e. each character in each word. This makes the JSON object quite unwieldy and very difficult to use. I'd like to be able to get the text and bounding boxes for "LINES", "PARAGRAPHS" and "BLOCKS", but I can't seem to find a way to do it via the AsyncAnnotateFileRequest() method.

The sample code is as follows:

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=180)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name=bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json_format.Parse(
        json_string, vision.types.AnnotateFileResponse())

    # The actual response for the first page of the input file.
    first_page_response = response.responses[0]
    annotation = first_page_response.full_text_annotation

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print(u'Full text:\n{}'.format(
        annotation.text))
Slowwitted answered 22/8, 2018 at 17:46 Comment(1)
#42391509Barrettbarrette
I
34

Unfortunately when using the DOCUMENT_TEXT_DETECTION type, you can only get the full text per-page, or the individual symbols. It's not too difficult to put together the paragraphs and lines from the symbols though, something like this should work (extending from your example):

breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
paragraphs = []
lines = []

for page in annotation.pages:
    for block in page.blocks:
        for paragraph in block.paragraphs:
            para = ""
            line = ""
            for word in paragraph.words:
                for symbol in word.symbols:
                    line += symbol.text
                    if symbol.property.detected_break.type == breaks.SPACE:
                        line += ' '
                    if symbol.property.detected_break.type == breaks.EOL_SURE_SPACE:
                        line += ' '
                        lines.append(line)
                        para += line
                        line = ''
                    if symbol.property.detected_break.type == breaks.LINE_BREAK:
                        lines.append(line)
                        para += line
                        line = ''
            paragraphs.append(para)

print(paragraphs)
print(lines)
Inversion answered 29/8, 2018 at 21:27 Comment(7)
This solution does the same thing as annotation.Text property, which is already built in.Hippie
No, it doesn't: the original question was originally using annotation.text, but that has exactly the problem they were asking about: it doesn't break up the response into lines and paragraphs. This solution does.Inversion
On my end, I'm getting the same results from annotation.text and from your code. Don't get me wrong, I like the break type filtering, which is why I voted this answer, but it doesn't improve my output.Hippie
Yes, the results will be the same, the question is about the structure of the results.Inversion
One thing I've found about this code is that symbol.property doesn't aways exist, which triggers an AttributeError. So I wrapped the if symbol.property... lines with a try/except AttributeError block and ignore the error with pass.Austerity
I understand that, I can get coordinates of paragraphs.Can I get coordinates of lines?Deathwatch
You can get bounding polys (boxes actually) of words and symbols, but not lines. The vertices are page coordinates, but always arranged top-left, top-right, bottom-right, bottom-left, in the local orientation of the symbol/word/paragraph. In my experience, when you have a mixture of text orientations, the association of words to paragraphs is somewhat random and paragraphs will be visually interleaved and overlapping. Messy to decipher...Southey

© 2022 - 2024 — McMap. All rights reserved.