How to extract text from two column pdf with Python?

Asked 11/3, 2019 at 10:43 Answered 28/5 at 22:15

I have :

I have a PDF which are in two-column format.Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

Isolationism answered 11/3, 2019 at 10:43 Comment(1)

What are your results so far? Apparently the pdf is in text format (NLP), not image (OCR). – Kristakristal 11/3, 2019 at 10:49

I found an alternative method, you can crop the pdf with two part, left and right, then merge left content and right content for every page, you can try this:

# https://github.com/jsvine/pdfplumber

import pdfplumber


x0 = 0    # Distance of left side of character from left side of page.
x1 = 0.5  # Distance of right side of character from left side of page.
y0 = 0  # Distance of bottom of character from bottom of page.
y1 = 1  # Distance of top of character from bottom of page.

all_content = []
with pdfplumber.open("file_path") as pdf:
    for i, page in enumerate(pdf.pages):
        width = page.width
        height = page.height

        # Crop pages
        left_bbox = (x0*float(width), y0*float(height), x1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        left_text = page_crop.extract_text()

        left_bbox = (0.5*float(width), y0*float(height), 1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        right_text = page_crop.extract_text()
        page_context = '\n'.join([left_text, right_text])
        all_content.append(page_context)
        if i < 2:  # help you see the merged first two pages
            print(page_context)

Chesterton answered 24/9, 2021 at 14:20 Comment(3)

Some pages may or may not have text spit into columns. How can I write an if- statement based on this? @Chesterton – Yankeeism 30/11, 2021 at 9:40

@Yankeeism Do you mean the statement "if i < 2"? It is used for seeing the merged first two pages. Or you can provide more info🤗 – Chesterton 1/12, 2021 at 4:16

No, but that's useful to note. I'm talking about when to apply column extraction; conditionally. I've made a post about it here: https://mcmap.net/q/1329403/-pdfplumber-extract-text-from-dynamic-column-layouts/16105404 – Yankeeism 1/12, 2021 at 9:6

This is code I use for regular pdf parsing, and it seems to work ok on that image (I downloaded an image, so this uses Optical Character Recognition, so its as accurate as regular OCR). Note that this tokenizes the text. Also note that you need to install tesseract for this to work (pytesseract just makes tesseract work from python). Tesseract is free & open source.

from PIL import Image
import pytesseract
import cv2
import os

def parse(image_path, threshold=False, blur=False):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    if threshold:
        gray = cv2.threshold(gray, 0, 255, \
            cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    if blur: #useful if salt-and-pepper background.
        gray = cv2.medianBlur(gray, 3)
    filename = "{}.png".format(os.getpid())
    cv2.imwrite(filename, gray) #Create a temp file
    text = pytesseract.image_to_string(Image.open(filename))
    os.remove(filename) #Remove the temp file
    text = text.split() #PROCESS HERE.
    print(text)
a = parse(image_path, True, False)

George answered 11/3, 2019 at 16:0 Comment(2)

Also I may have borrowed that code from someone else a while back, I don't actually recall if that specific snippit is mine or someone elses. – George 11/3, 2019 at 16:1

Worked better without "if"s for me. – Ulda 11/1, 2021 at 20:34

What worked for me was using a Python script named multi_column.py, which can be used as a command-line tool or imported as a module. You have to copy the code in the link to the github page and paste it in your work directory.

As a Command-Line Tool

python multi_column.py input.pdf footer_margin

As a Module

import pymupdf
from multi_column import column_boxes

doc = pymupdf.open("sample.pdf")
for page in doc:
    bboxes = column_boxes(page, footer_margin=50, no_image_text=True)
    for rect in bboxes:
        print(page.get_text(clip=rect, sort=True))
    print("-" * 80)

...and Boom, your pdf is printed by blocks of text!

if you want to check the effectiveness of this function, i recommend to check the source to this answer where the autor shows a study case:

https://artifex.com/blog/extract-text-from-a-multi-column-document-using-pymupdf-inpython

Hope this works!

Onder answered 28/5 at 22:15 Comment(0)

Recommended topics

Hot tags