I have a PDF which are in two-column format.Is there a way to read each PDF according to the two-column format without cropping each PDF individually?
I found an alternative method, you can crop the pdf with two part, left and right, then merge left content and right content for every page, you can try this:
# https://github.com/jsvine/pdfplumber
import pdfplumber
x0 = 0 # Distance of left side of character from left side of page.
x1 = 0.5 # Distance of right side of character from left side of page.
y0 = 0 # Distance of bottom of character from bottom of page.
y1 = 1 # Distance of top of character from bottom of page.
all_content = []
with pdfplumber.open("file_path") as pdf:
for i, page in enumerate(pdf.pages):
width = page.width
height = page.height
# Crop pages
left_bbox = (x0*float(width), y0*float(height), x1*float(width), y1*float(height))
page_crop = page.crop(bbox=left_bbox)
left_text = page_crop.extract_text()
left_bbox = (0.5*float(width), y0*float(height), 1*float(width), y1*float(height))
page_crop = page.crop(bbox=left_bbox)
right_text = page_crop.extract_text()
page_context = '\n'.join([left_text, right_text])
all_content.append(page_context)
if i < 2: # help you see the merged first two pages
print(page_context)
This is code I use for regular pdf parsing, and it seems to work ok on that image (I downloaded an image, so this uses Optical Character Recognition, so its as accurate as regular OCR). Note that this tokenizes the text. Also note that you need to install tesseract for this to work (pytesseract just makes tesseract work from python). Tesseract is free & open source.
from PIL import Image
import pytesseract
import cv2
import os
def parse(image_path, threshold=False, blur=False):
image = cv2.imread(image_path)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if threshold:
gray = cv2.threshold(gray, 0, 255, \
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
if blur: #useful if salt-and-pepper background.
gray = cv2.medianBlur(gray, 3)
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray) #Create a temp file
text = pytesseract.image_to_string(Image.open(filename))
os.remove(filename) #Remove the temp file
text = text.split() #PROCESS HERE.
print(text)
a = parse(image_path, True, False)
What worked for me was using a Python script named multi_column.py, which can be used as a command-line tool or imported as a module. You have to copy the code in the link to the github page and paste it in your work directory.
As a Command-Line Tool
python multi_column.py input.pdf footer_margin
As a Module
import pymupdf
from multi_column import column_boxes
doc = pymupdf.open("sample.pdf")
for page in doc:
bboxes = column_boxes(page, footer_margin=50, no_image_text=True)
for rect in bboxes:
print(page.get_text(clip=rect, sort=True))
print("-" * 80)
...and Boom, your pdf is printed by blocks of text!
if you want to check the effectiveness of this function, i recommend to check the source to this answer where the autor shows a study case:
https://artifex.com/blog/extract-text-from-a-multi-column-document-using-pymupdf-inpython
Hope this works!
© 2022 - 2024 — McMap. All rights reserved.