This is but a partial solution to your issue, but it may simplify the task at hand.
This tool receives PDF files and converts them to text files. It works pretty fast and can run on bulks of files.
It creates an output text file for each PDF. The advantage of this tool over others is that the output texts are aligned with accordance to their original layout.
For example, this is a resume with complex layout:
The output for it is the following text file:
Christopher Summary
Senior Web Developer specializing in front end development.
Morgan Experienced with all stages of the development cycle for
dynamic web projects. Well-versed in numerous programming
languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.
Strong background in project management and customer
relations.
Skill Highlights
• Project management • Creative design
• Strong decision maker • Innovative
• Complex problem • Service-focused
solver
Experience
Contact
Web Developer - 09/2015 to 05/2019
Address: Luna Web Design, New York
177 Great Portland Street, London • Cooperate with designers to create clean interfaces and
W5W 6PQ simple, intuitive interactions and experiences.
• Develop project concepts and maintain optimal
Phone: workflow.
+44 (0)20 7666 8555
• Work with senior developer to manage large, complex
design projects for corporate clients.
Email:
• Complete detailed programming and development tasks
[email protected]
for front end public and internal websites as well as
challenging back-end server code.
LinkedIn:
• Carry out quality assurance tests to discover errors and
linkedin.com/christopher.morgan
optimize usability.
Languages Education
Spanish – C2
Bachelor of Science: Computer Information Systems - 2014
Chinese – A1
Columbia University, NY
German – A2
Hobbies Certifications
PHP Framework (certificate): Zend, Codeigniter, Symfony.
• Writing
Programming Languages: JavaScript, HTML5, PHP OOP, CSS,
• Sketching
SQL, MySQL.
• Photography
• Design
-----------------------Page 1 End-----------------------
Now your task is reduced to finding the bulks within a text file, and using the spaces between words as alignment hints.
As a start, I include a script that finds the margin between to columns of text and yields rhs
and lhs
- the text stream of the right and left columns respectively.
import numpy as np
import matplotlib.pyplot as plt
import re
txt_lines = txt.split('\n')
max_line_index = max([len(line) for line in txt_lines])
padded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spaces
space_idx_counters = np.zeros(max_line_index)
for idx, line in enumerate(padded_txt_lines):
if line.find("-----------------------Page") >= 0: # reached end of page
break
space_idxs = [pos for pos, char in enumerate(line) if char == " "]
space_idx_counters[space_idxs] += 1
padded_txt_lines = padded_txt_lines[:idx] #remove end page line
# plot histogram of spaces in each character column
plt.bar(list(range(len(space_idx_counters))), space_idx_counters)
plt.title("Number of spaces in each column over all lines")
plt.show()
# find the separator column idx
separator_idx = np.argmax(space_idx_counters)
print(f"separator index: {separator_idx}")
left_lines = []
right_lines = []
# separate two columns of text
for line in padded_txt_lines:
left_lines.append(line[:separator_idx])
right_lines.append(line[separator_idx:])
# join each bulk into one stream of text, remove redundant spaces
lhs = ' '.join(left_lines)
lhs = re.sub("\s{4,}", " ", lhs)
rhs = ' '.join(right_lines)
rhs = re.sub("\s{4,}", " ", rhs)
print("************ Left Hand Side ************")
print(lhs)
print("************ Right Hand Side ************")
print(rhs)
Plot output:
Text output:
separator index: 33
************ Left Hand Side ************
Christopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: [email protected] LinkedIn: linkedin.com/christopher.morgan Languages Spanish – C2 Chinese – A1 German – A2 Hobbies • Writing • Sketching • Photography • Design
************ Right Hand Side ************
Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights • Project management • Creative design • Strong decision maker • Innovative • Complex problem • Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York • Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. • Develop project concepts and maintain optimal workflow. • Work with senior developer to manage large, complex design projects for corporate clients. • Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. • Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL.
The next step would be to generalize this script to work on multi-page documents, remove redundant signs, etc.
Good luck!
easyocr
package. The results is quite good. – Dyestuff