Document Layout Analysis for text extraction
Asked Answered
D

5

6

I need to analyze the layout structure of different documents type like: pdf, doc, docx, odt etc.

My task is: Giving a document, group the text in blocks finding the correct boundaries of each.

I did some tests using Apache Tika, which is a good extractor, it is a very good tool but it often mess up the order of the block, let me explain a bit what i mean with ORDER.

Apache Tika just extracts the text, so if my document has two columns, Tika extracts the entire text of the first column and then the text of the second column, which is ok...but sometimes the text on the first column is related to the text on the second, like a table that has row relation.

So i must take care of the positions of each block, so the problems are:

  1. Define the box boundaries, which is hard... i should understand if a sentence is starting a new block or not.

  2. Define the orientation, for example, giving a table the "sentence" should be the row, NOT the column.

So basically here i have to deal with the layout structure to correcly understand the block boundaries.

I give you a visual example:

enter image description here

A classical extractor returns:

2019
2018
2017
2016
2015
2014
Oregon Arts Commission Individual Artist Fellowship...

Which is wrong (in my case) because the dates are related to the texts on the right.

This task is preparatory for other NLP analysis, so it is very important, because, for example doing, when i need to recognize the entities(NER) inside the text, and then identify their relations, working with the correct context is very important.

How to extract the text from the document and assembly related pieces of text (understanding the layout structure of the document) under the same block?

Daughter answered 4/3, 2021 at 11:20 Comment(3)
Have you tried converting a file to html (using pandoc or something else) and see if it is more usable thath way ?Maxim
@Maxim hmm, sincerelly no, but i wonder how it can understand the context. Because if it creates a <table> i need however understand if i should read per column or row. So i think NLP is involved in this analysisDaughter
I have added example of using easyocr package. The results is quite good.Dyestuff
H
6

This is but a partial solution to your issue, but it may simplify the task at hand. This tool receives PDF files and converts them to text files. It works pretty fast and can run on bulks of files.

It creates an output text file for each PDF. The advantage of this tool over others is that the output texts are aligned with accordance to their original layout.

For example, this is a resume with complex layout:

enter image description here

The output for it is the following text file:

Christopher                         Summary
                                    Senior Web Developer specializing in front end development.
Morgan                              Experienced with all stages of the development cycle for
                                    dynamic web projects. Well-versed in numerous programming
                                    languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.
                                    Strong background in project management and customer
                                    relations.


                                    Skill Highlights
                                        •   Project management          •   Creative design
                                        •   Strong decision maker       •   Innovative
                                        •   Complex problem             •   Service-focused
                                            solver


                                    Experience
Contact
                                    Web Developer - 09/2015 to 05/2019
Address:                            Luna Web Design, New York
177 Great Portland Street, London      • Cooperate with designers to create clean interfaces and
W5W 6PQ                                   simple, intuitive interactions and experiences.
                                       • Develop project concepts and maintain optimal
Phone:                                    workflow.
+44 (0)20 7666 8555
                                       • Work with senior developer to manage large, complex
                                          design projects for corporate clients.
Email:
                                       • Complete detailed programming and development tasks
[email protected]
                                          for front end public and internal websites as well as
                                          challenging back-end server code.
LinkedIn:
                                       • Carry out quality assurance tests to discover errors and
linkedin.com/christopher.morgan
                                          optimize usability.

Languages                           Education
Spanish – C2
                                    Bachelor of Science: Computer Information Systems - 2014
Chinese – A1
                                    Columbia University, NY
German – A2


Hobbies                             Certifications
                                    PHP Framework (certificate): Zend, Codeigniter, Symfony.
   •   Writing
                                    Programming Languages: JavaScript, HTML5, PHP OOP, CSS,
   •   Sketching
                                    SQL, MySQL.
   •   Photography
   •   Design
-----------------------Page 1 End-----------------------

Now your task is reduced to finding the bulks within a text file, and using the spaces between words as alignment hints. As a start, I include a script that finds the margin between to columns of text and yields rhs and lhs - the text stream of the right and left columns respectively.

import numpy as np
import matplotlib.pyplot as plt
import re

txt_lines = txt.split('\n')
max_line_index = max([len(line) for line in txt_lines])
padded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spaces
space_idx_counters = np.zeros(max_line_index)

for idx, line in enumerate(padded_txt_lines):
    if line.find("-----------------------Page") >= 0: # reached end of page
        break
    space_idxs = [pos for pos, char in enumerate(line) if char == " "]
    space_idx_counters[space_idxs] += 1

padded_txt_lines = padded_txt_lines[:idx] #remove end page line

# plot histogram of spaces in each character column
plt.bar(list(range(len(space_idx_counters))), space_idx_counters)
plt.title("Number of spaces in each column over all lines")
plt.show()

# find the separator column idx
separator_idx = np.argmax(space_idx_counters)
print(f"separator index: {separator_idx}")
left_lines = []
right_lines = []

# separate two columns of text
for line in padded_txt_lines:
    left_lines.append(line[:separator_idx])
    right_lines.append(line[separator_idx:])

# join each bulk into one stream of text, remove redundant spaces
lhs = ' '.join(left_lines)
lhs = re.sub("\s{4,}", " ", lhs)
rhs = ' '.join(right_lines)
rhs = re.sub("\s{4,}", " ", rhs)

print("************ Left Hand Side ************")
print(lhs)
print("************ Right Hand Side ************")
print(rhs)

Plot output:

enter image description here

Text output:

separator index: 33
************ Left Hand Side ************
Christopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: [email protected] LinkedIn: linkedin.com/christopher.morgan Languages Spanish – C2 Chinese – A1 German – A2 Hobbies •   Writing •   Sketching •   Photography •   Design 
************ Right Hand Side ************
   Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights •   Project management •   Creative design •   Strong decision maker •   Innovative •   Complex problem •   Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York • Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. • Develop project concepts and maintain optimal workflow. • Work with senior developer to manage large, complex design projects for corporate clients. • Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. • Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL. 

The next step would be to generalize this script to work on multi-page documents, remove redundant signs, etc.

Good luck!

Hummel answered 9/3, 2021 at 16:9 Comment(4)
Hi @Shir! that's really interesting... thanks. I am studying your code right now, i wonder if having the correct position of each word of the document then it could give me problems when i will need to do the sentence segmentation. Maybe I should recursivelly using your code to define more columns or for example when there are the label on the LHS and the value on the RHS, here the problem could be when the label is splitted on more lines. What do you think? Do you know what kind of tool this website is using?Daughter
Hi @Dail, I think you should define which types of layouts you wish to deal with and prioritize them. Also, properly define the type of output you expect, e.g., do lists of skills qualify as a single sentence or not? The code I included deals with a simple one-page two-columns layout. If you disregard the last block of code that joins the bulks, you have an initial separation to sentences. Using it recursively could definitely assist managing more complex layouts. Also, adding a component that works on different pages separately could be easily incorporated.Hummel
As per the algorithm the tool uses, I am not quite sure. Its just a tool I found online. However, they are other open sourced tools you can use and dive into their code, and see the OCR methods, e.g. pdfminer and its github repo. Let me know if you have more questions.Hummel
You said my life!Glossa
P
1

For your example, tesseract was able to produce the desired output after configuring the Page segmentation mode via the --psm flag. See docs

--psm 6 Assume a single uniform block of text.

Assume a single uniform block of text.

Of course, tesseract works with images. You could try converting pdfs to images with pdf2image. For the .docx, .doc, .odt formats one option would be using pywin32 to handle the formatting to pdf.

Publish answered 9/3, 2021 at 3:44 Comment(1)
yes but i must deal with other problems too, here i am asking what approach should i use, maybe training a new model? or training tesseract somehow?Daughter
D
0

You can use easyocr. It uses Deep Learning models to extract characters. . It returns both the words and location of the words in a paper. The steps will be transforming your documents to image and then do the analysis.

#pip install -U easyocr
import easyocr

language = "en"

image_path = "https://i.sstatic.net/i6vHT.png"

reader = easyocr.Reader([language])
response = reader.readtext(image_path, detail=True)

print(response)

Here is example when we just ignore bounding box details. enter image description here

The text is gathered correctly as it appears.

Dyestuff answered 12/3, 2021 at 15:24 Comment(2)
Hello! thanks for your message The problem with an OCR is that i need to implement a difficult sentence segmentation, because if i do .split('\n') i will have mixed box of texts. Do you have any idea about that?Daughter
You have a bounding box for each detection. It means you know where the text is. Above was just printing to show you that everything is in orderDyestuff
C
0

Check the Konfuzio documentation for text analysis and extraction. You can define your own model and access the data.

You can get the layout structure of the document using Konfuzio even for documents with 2 columns layout. It segments the document in 5 classes: text, title, list, table and figure.

# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init

from konfuzio_sdk.api import get_results_from_segmentation

result = get_results_from_segmentation(doc_id=1111, project_id=111) 

The result will contain the bounding boxes of the different elements in the document and the respective classification. You can use the bounding box information of the elements to find which ones are in the same row, for example.

https://github.com/konfuzio-ai/document-ai-python-sdk/issues/7

Calico answered 5/5, 2021 at 5:31 Comment(0)
S
0

You may want to check out the document layout parser on github.

Selfcontent answered 23/8, 2021 at 6:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.