Text extraction - line-by-line

Asked 22/2, 2017 at 12:6 Answered 17/3, 2022 at 18:32

I am using Google Vision API, primarily to extract texts. I works fine, but for specific cases where I would need the API to scan the enter line, spits out the text before moving to the next line. However, it appears that the API is using some kind of logic that makes it scan top to bottom on the left side and moving to right side and doing a top to bottom scan. I would have liked if the API read left-to-right, move down and so on.

For example, consider the image:

The API returns the text like this:

“ Name DOB Gender: Lives In John Doe 01-Jan-1970 LA ”

Whereas, I would have expected something like this:

“ Name: John Doe DOB: 01-Jan-1970 Gender: M Lives In: LA ”

I suppose there is a way to define the block size or margin setting (?) to read the image/scan line by line?

Thanks for your help. Alex

Tran answered 22/2, 2017 at 12:6 Comment(0)

This might be a late answer but adding it for future reference. You can add feature hints to your JSON request to get the desired results.

{
  "requests": [
    {
      "image": {
        "source": {
          "imageUri": "https://i.sstatic.net/TRTXo.png"
        }
      },
      "features": [
        {
          "type": "DOCUMENT_TEXT_DETECTION"
        }
      ]
    }
  ]
}

For text which are very far apart the DOCUMENT_TEXT_DETECTION also does not provide proper line segmentation.

The following code does simple line segmentation based on the character polygon coordinates.

https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision

Suricate answered 16/1, 2018 at 10:24 Comment(5)

I saw this code and it very short to read, but I want to using it in Java, How to covert that? – Zuleika 29/1, 2018 at 6:51

The syntax is more or less the same. The algorithm uses a polygon computation library, so a similar library should be used to find out if a point is inside a polygon in Java. – Suricate 29/1, 2018 at 12:21

thank you, I codded that in java with: space overlap of two rectangles – Zuleika 31/1, 2018 at 15:24

This javascript code works for me but can I get same code for python? – Bimestrial 15/5, 2018 at 9:42

I've copy to web browser version of this repository here, on gitHub. – Whiting 6/12, 2022 at 3:21

Here a simple code to read line by line. y-axis for lines and x-axis for each word in the line.

items = []
lines = {}

for text in response.text_annotations[1:]:
    top_x_axis = text.bounding_poly.vertices[0].x
    top_y_axis = text.bounding_poly.vertices[0].y
    bottom_y_axis = text.bounding_poly.vertices[3].y

    if top_y_axis not in lines:
        lines[top_y_axis] = [(top_y_axis, bottom_y_axis), []]

    for s_top_y_axis, s_item in lines.items():
        if top_y_axis < s_item[0][1]:
            lines[s_top_y_axis][1].append((top_x_axis, text.description))
            break

for _, item in lines.items():
    if item[1]:
        words = sorted(item[1], key=lambda t: t[0])
        items.append((item[0], ' '.join([word for _, word in words]), words))

print(items)

Wivinia answered 26/1, 2019 at 16:0 Comment(2)

For some reason, google vision split totals. For example: 161.765,31. It split it into five words [161, ., 765, ,, 31]. Is there a configuration am I missing? – Early 1/2, 2020 at 0:52

This does not work if the image is slightly rotated – Keeling 12/10, 2021 at 18:24

You can extract the text based on the bounds per line too, you can use boundyPoly and concatenate the text in the same line

"boundingPoly": {
        "vertices": [
          {
            "x": 87,
            "y": 148
          },
          {
            "x": 411,
            "y": 148
          },
          {
            "x": 411,
            "y": 206
          },
          {
            "x": 87,
            "y": 206
          }
        ]

for example this 2 words are in the same "line"

"description": "you",
      "boundingPoly": {
        "vertices": [
          {
            "x": 362,
            "y": 1406
          },
          {
            "x": 433,
            "y": 1406
          },
          {
            "x": 433,
            "y": 1448
          },
          {
            "x": 362,
            "y": 1448
          }
        ]
      }
    },
    {
      "description": "start",
      "boundingPoly": {
        "vertices": [
          {
            "x": 446,
            "y": 1406
          },
          {
            "x": 540,
            "y": 1406
          },
          {
            "x": 540,
            "y": 1448
          },
          {
            "x": 446,
            "y": 1448
          }
        ]
      }
    }

Ramos answered 15/6, 2017 at 10:9 Comment(1)

Thanks, that is one possibility. – Tran 27/9, 2017 at 7:41

I get max and min y and iterate over y to get all potential lines, here is the full code

import io
import sys
from os import listdir

from google.cloud import vision


def read_image(image_file):
    client = vision.ImageAnnotatorClient()

    with io.open(image_file, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    return client.document_text_detection(
        image=image,
        image_context={"language_hints": ["bg"]}
    )


def extract_paragraphs(image_file):
    response = read_image(image_file)

    min_y = sys.maxsize
    max_y = -1
    for t in response.text_annotations:
        poly_range = get_poly_y_range(t.bounding_poly)
        t_min = min(poly_range)
        t_max = max(poly_range)
        if t_min < min_y:
            min_y = t_min
        if t_max > max_y:
            max_y = t_max
    max_size = max_y - min_y

    text_boxes = []
    for t in response.text_annotations:
        poly_range = get_poly_y_range(t.bounding_poly)
        t_x = get_poly_x(t.bounding_poly)
        t_min = min(poly_range)
        t_max = max(poly_range)
        poly_size = t_max - t_min
        text_boxes.append({
            'min_y': t_min,
            'max_y': t_max,
            'x': t_x,
            'size': poly_size,
            'description': t.description
        })

    paragraphs = []
    for i in range(min_y, max_y):
        para_line = []
        for text_box in text_boxes:
            t_min = text_box['min_y']
            t_max = text_box['max_y']
            x = text_box['x']
            size = text_box['size']

            # size < max_size excludes the biggest rect
            if size < max_size * 0.9 and t_min <= i <= t_max:
                para_line.append(
                    {
                        'text': text_box['description'],
                        'x': x
                    }
                )
        # here I have to sort them by x so the don't get randomly shuffled
        para_line = sorted(para_line, key=lambda x: x['x'])
        line = " ".join(map(lambda x: x['text'], para_line))
        paragraphs.append(line)
        # if line not in paragraphs:
        #     paragraphs.append(line)

    return "\n".join(paragraphs)


def get_poly_y_range(poly):
    y_list = []
    for v in poly.vertices:
        if v.y not in y_list:
            y_list.append(v.y)
    return y_list


def get_poly_x(poly):
    return poly.vertices[0].x




def extract_paragraphs_from_image(picName):
    print(picName)
    pic_path = rootPics + "/" + picName

    text = extract_paragraphs(pic_path)

    text_path = outputRoot + "/" + picName + ".txt"
    write(text_path, text)

This code is WIP.

In the end, I get the same line multiple times and post-processing to determine the exact values. (paragraphs variable). Let me know if I have to clarify anything

Morphophonemics answered 22/11, 2021 at 16:23 Comment(2)

Hi @Borislav Stoilov ! Do you remember what value did you give to the const "PARAGRAPH_HEIGHT" ? It would help me grately if you do answer. Thanks! – Subdue 16/3, 2022 at 12:0

@Subdue no, I changed my approach entirely. Now I get the max x and the max y across all rectangles. And then I iterate from minY to maxY, every rect that is in the current y is potential line. I will just post my code in the answer. – Morphophonemics 16/3, 2022 at 13:16

Inspired by Borislav's answer, I just wrote something for python that also works for handwriting. It's messy and I am new to python, but I think you can get an idea of how to do this.

A class to hold some extended data for each word, for example, the average y position of a word, which I used to calculate the differences between words:

import re
from operator import attrgetter

import numpy as np

class ExtendedAnnotation:
    def __init__(self, annotation):
        self.vertex = annotation.bounding_poly.vertices
        self.text = annotation.description
        self.avg_y = (self.vertex[0].y + self.vertex[1].y + self.vertex[2].y + self.vertex[3].y) / 4
        self.height = ((self.vertex[3].y - self.vertex[1].y) + (self.vertex[2].y - self.vertex[0].y)) / 2
        self.start_x = (self.vertex[0].x + self.vertex[3].x) / 2

    def __repr__(self):
        return '{' + self.text + ', ' + str(self.avg_y) + ', ' + str(self.height) + ', ' + str(self.start_x) + '}'

Create objects with that data:

def get_extended_annotations(response):
    extended_annotations = []
    for annotation in response.text_annotations:
        extended_annotations.append(ExtendedAnnotation(annotation))

    # delete last item, as it is the whole text I guess.
    del extended_annotations[0]
    return extended_annotations

Calculate the threshold.
First, all words a sorted by their y position, defined as being the average of all 4 corners of a word. The x position is not relevant at this moment. Then, the differences between every word and their following word are calculated. For a perfectly straight line of words, you would expect the differences of the y position between every two words to be 0. Even for handwriting, it should be around 1 ~ 10.
However, whenever there is a line break, the difference between the last word of the former row and the first word of the new row is much greater than that, for example, 50 or 60.
So to decide whether there should be a line break between two words, the standard deviation of the differences is used.

def get_threshold_for_y_difference(annotations):
    annotations.sort(key=attrgetter('avg_y'))
    differences = []
    for i in range(0, len(annotations)):
        if i == 0:
            continue
        differences.append(abs(annotations[i].avg_y - annotations[i - 1].avg_y))
    return np.std(differences)

Having calculated the threshold, the list of all words gets grouped into rows accordingly.

def group_annotations(annotations, threshold):
    annotations.sort(key=attrgetter('avg_y'))
    line_index = 0
    text = [[]]
    for i in range(0, len(annotations)):
        if i == 0:
            text[line_index].append(annotations[i])
            continue
        y_difference = abs(annotations[i].avg_y - annotations[i - 1].avg_y)
        if y_difference > threshold:
            line_index = line_index + 1
            text.append([])
        text[line_index].append(annotations[i])
    return text

Finally, each row is sorted by their x position to get them into the correct order from left to right.
Then a little regex is used to remove whitespace in front of interpunctuation.

def sort_and_combine_grouped_annotations(annotation_lists):
    grouped_list = []
    for annotation_group in annotation_lists:
        annotation_group.sort(key=attrgetter('start_x'))
        texts = (o.text for o in annotation_group)
        texts = ' '.join(texts)
        texts = re.sub(r'\s([-;:?.!](?:\s|$))', r'\1', texts)
        grouped_list.append(texts)
    return grouped_list

Ligialignaloes answered 25/11, 2021 at 4:1 Comment(0)

Based on Borislav Stoilov latest answer I wrote the code for c# for anybody that might need it in the future. Find the code bellow:

public static List<TextParagraph> ExtractParagraphs(IReadOnlyList<EntityAnnotation> textAnnotations)
    {
        var min_y = int.MaxValue;
        var max_y = -1;
        foreach (var item in textAnnotations)
        {
            var poly_range = Get_poly_y_range(item.BoundingPoly);
            var t_min = poly_range.Min();
            var t_max = poly_range.Max();
            if (t_min < min_y) min_y = t_min;
            if (t_max > max_y) max_y = t_max;
        }
        var max_size = max_y - min_y;
        var text_boxes = new List<TextBox>();

        foreach (var item in textAnnotations)
        {
            var poly_range = Get_poly_y_range(item.BoundingPoly);
            var t_x = Get_poly_x(item.BoundingPoly);
            var t_min = poly_range.Min();
            var t_max = poly_range.Max();
            var poly_size = t_max - t_min;
            text_boxes.Add(new TextBox
            {
                Min_y = t_min,
                Max_y = t_max,
                X = t_x,
                Size = poly_size,
                Description = item.Description
            });
        }

        var paragraphs = new List<TextParagraph>();
        for (int i = min_y; i < max_y; i++)
        {
            var para_line = new List<TextLine>();
            foreach (var text_box in text_boxes)
            {
                int t_min = text_box.Min_y;
                int t_max = text_box.Max_y;
                int x = text_box.X;
                int size = text_box.Size;

                //# size < max_size excludes the biggest rect
                if (size < (max_size * 0.9) && t_min <= i && i <= t_max)
                    para_line.Add(
                        new TextLine
                        {
                            Text = text_box.Description,
                            X = x
                        }
                    );
            }

            // here I have to sort them by x so the don't get randomly enter code hereshuffled
            para_line = para_line.OrderBy(x => x.X).ToList();
            var line = string.Join(" ", para_line.Select(x => x.Text));
            var paragraph = new TextParagraph
            {
                Order = i,
                Text = line,
                WordCount = para_line.Count,
                TextBoxes = para_line
            };
            paragraphs.Add(paragraph);
        }
        return paragraphs;
        //return string.Join("\n", paragraphs);

    }

    private static List<int> Get_poly_y_range(BoundingPoly poly)
    {
        var y_list = new List<int>();
        foreach (var v in poly.Vertices)
        {
            if (!y_list.Contains(v.Y))
            {
                y_list.Add(v.Y);
            }
        }
        return y_list;
    }

    private static int Get_poly_x(BoundingPoly poly)
    {
        return poly.Vertices[0].X;
    }

Calling ExtractParagraphs() method will return a list of strings which contains doubles from the file. I also wrote some custom code to treat that problem. If you need any help processing the doubles let me know, and I could provide the rest of the code.
Example:
Text in picture: "I want to make this thing work 24/7!"
Code will return:
"I"
"I want"
"I want to "
"I want to make"
"I want to make this"
"I want to make this thing"
"I want to make this thing work"
"I want to make this thing work 24/7!"
"to make this thing work 24/7!"
"this thing work 24/7!"
"thing work 24/7!"
"work 24/7!"
"24/7!"

I also have an implementation of parsing PDFs to PNGs beacause Google Cloud Vision Api won't accept PDFs that are not stored in the Cloud Bucket. If needed I can provide it. Happy coding!

Subdue answered 17/3, 2022 at 18:32 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags