Inspired by Borislav's answer, I just wrote something for python that also works for handwriting. It's messy and I am new to python, but I think you can get an idea of how to do this.
A class to hold some extended data for each word, for example, the average y position of a word, which I used to calculate the differences between words:
import re
from operator import attrgetter
import numpy as np
class ExtendedAnnotation:
def __init__(self, annotation):
self.vertex = annotation.bounding_poly.vertices
self.text = annotation.description
self.avg_y = (self.vertex[0].y + self.vertex[1].y + self.vertex[2].y + self.vertex[3].y) / 4
self.height = ((self.vertex[3].y - self.vertex[1].y) + (self.vertex[2].y - self.vertex[0].y)) / 2
self.start_x = (self.vertex[0].x + self.vertex[3].x) / 2
def __repr__(self):
return '{' + self.text + ', ' + str(self.avg_y) + ', ' + str(self.height) + ', ' + str(self.start_x) + '}'
Create objects with that data:
def get_extended_annotations(response):
extended_annotations = []
for annotation in response.text_annotations:
extended_annotations.append(ExtendedAnnotation(annotation))
# delete last item, as it is the whole text I guess.
del extended_annotations[0]
return extended_annotations
Calculate the threshold.
First, all words a sorted by their y position, defined as being the average of all 4 corners of a word. The x position is not relevant at this moment.
Then, the differences between every word and their following word are calculated. For a perfectly straight line of words, you would expect the differences of the y position between every two words to be 0. Even for handwriting, it should be around 1 ~ 10.
However, whenever there is a line break, the difference between the last word of the former row and the first word of the new row is much greater than that, for example, 50 or 60.
So to decide whether there should be a line break between two words, the standard deviation of the differences is used.
def get_threshold_for_y_difference(annotations):
annotations.sort(key=attrgetter('avg_y'))
differences = []
for i in range(0, len(annotations)):
if i == 0:
continue
differences.append(abs(annotations[i].avg_y - annotations[i - 1].avg_y))
return np.std(differences)
Having calculated the threshold, the list of all words gets grouped into rows accordingly.
def group_annotations(annotations, threshold):
annotations.sort(key=attrgetter('avg_y'))
line_index = 0
text = [[]]
for i in range(0, len(annotations)):
if i == 0:
text[line_index].append(annotations[i])
continue
y_difference = abs(annotations[i].avg_y - annotations[i - 1].avg_y)
if y_difference > threshold:
line_index = line_index + 1
text.append([])
text[line_index].append(annotations[i])
return text
Finally, each row is sorted by their x position to get them into the correct order from left to right.
Then a little regex is used to remove whitespace in front of interpunctuation.
def sort_and_combine_grouped_annotations(annotation_lists):
grouped_list = []
for annotation_group in annotation_lists:
annotation_group.sort(key=attrgetter('start_x'))
texts = (o.text for o in annotation_group)
texts = ' '.join(texts)
texts = re.sub(r'\s([-;:?.!](?:\s|$))', r'\1', texts)
grouped_list.append(texts)
return grouped_list