Detect and crop a box in .pdf or image as individual images

Asked 17/7, 2019 at 4:30 Answered 22/11, 2021 at 11:22

python opencv image-processing computer-vision pypdf

I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I automatically do this for a large, multi-page .pdf using python?

I tried using the PyPDF2 package to crop one of the handwriting boxes based on (x,y) coordinates, however this approach doesn't work for me as the boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf. I believe detecting the boxes would be a better approach for auto-cropping. Not sure if its useful, but below is the code I used for (x,y) coordinate approach:

from PyPDF2 import PdfFileReader, PdfFileWriter

reader = PdfFileReader("data/samples.pdf", "r")

# getting the first page
page = reader.getPage(0)

writer = PdfFileWriter()

# Loop through all pages in pdf object to crop based on (x,y) coordinates
for i in range(reader.getNumPages()):
    page = reader.getPage(i)
    page.cropBox.setLowerLeft((42, 115))
    page.cropBox.setUpperRight((500, 245))
    writer.addPage(page)

with open("samples_cropped.pdf", "wb") as fp:
    writer.write(fp)

Thank you in advance for your help

Arsenious answered 17/7, 2019 at 4:30 Comment(6)

PDF is a vector format. It has no size in pixels until it is rasterized by providing a density when it is read. So you need to either first rasterize it or if it has an embedded image, then extract it with something like pdfimages. Once you have that done, you can use OpenCV or Imagemagick to find contours or blobs and then use connected components to find the bounding boxes of the rectangles. Then you can crop those regions. – Broomcorn 17/7, 2019 at 4:34

@Broomcorn thanks for sharing. i'm not an expert on this, so excuse my novice questions. So step 1 would simply be to convert all the pages in the pdf to image format, like jpeg? It might be helpful to add that this pdf document is generated from a scanner feeder, so im not sure about "embedded image[s]" – Arsenious 17/7, 2019 at 4:55

If it was scanned, then it is likely a raster image imbedded in a vector PDF shell. So the best method would be to use pdfimages to extract the raster image preferably as PNG or TIFF and not JPG. JPG is a lossy compression format. See linux.die.net/man/1/pdfimages and cyberciti.biz/faq/easily-extract-images-from-pdf-file – Broomcorn 17/7, 2019 at 5:7

@Broomcorn I am trying to follow the pdfimages linux instructions you shared : cyberciti.biz/faq/easily-extract-images-from-pdf-file . I looked into it and learned that i need to use the subprocess package to execute linux commands in my python code. But how do I correctly install apt-get install poppler-utils ? Do i need to use homebrew? In case it's relevant, i am on macOS, using conda virtual environment, coding in python on jupyter lab. – Arsenious 17/7, 2019 at 21:24

Best to check Homebrew for that package. I am on a Mac and did it from MacPorts. – Broomcorn 17/7, 2019 at 21:33

Why not utilize contour detection and filtering methods such as contour area or aspect ratio to extract the ROI of the box? – Umbria 17/7, 2019 at 21:53

Here's a simple approach using OpenCV

Convert image to grayscale and Gaussian blur
Threshold image
Find contours
Iterate through contours and filter using contour area
Extract ROI

After extracting the ROI, you can save each as a separate image and then perform OCR text extraction using pytesseract or some other tool.

Results

You mention this

The boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf.

Currently, your approach of using (x,y) coordinates isn't very robust since the boxes could be anywhere on the image. A better approach is to filter using a minimum threshold contour area to detect the boxes. Depending on how small/large of a box you want to detect, you can adjust the variable. If you want additional filtering to prevent false positives, you can add into aspect ratio as another filtering mechanism. For instance, calculating aspect ratio for each contour then if it is within bounds (say 0.8 to 1.2 for a square/rectangle ROI) then it's a valid box.

import cv2

image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (3, 3), 0)
thresh = cv2.threshold(blurred, 230,255,cv2.THRESH_BINARY_INV)[1]

# Find contours
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# Iterate thorugh contours and filter for ROI
image_number = 0
min_area = 10000
for c in cnts:
    area = cv2.contourArea(c)
    if area > min_area:
        x,y,w,h = cv2.boundingRect(c)
        cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)
        ROI = original[y:y+h, x:x+w]
        cv2.imwrite("ROI_{}.png".format(image_number), ROI)
        image_number += 1

cv2.imshow('image', image)
cv2.waitKey(0)

Umbria answered 17/7, 2019 at 21:38 Comment(6)

Thanks for sharing. I have a couple initial questions 1.) How would you recommend i convert/extract my multi-page pdf to images? This would be the first step before converting image(s) to grayscale and Gaussian blur. 2.) Your solution is for processing one image, how can this be modified for processing multiple images at once, hence all the pages in the pdf file? – Arsenious 17/7, 2019 at 22:15

@Steve If your PDFs are raster files in a PDF vector shell, then you can use pdfimages to extract the raster files. From Python, you would need to use subprocess calls, since pdfimages is not Python based. Alternately, you can use Imagemagick or other Python based tools to rasterize your pdf into multiple images. With regard to processing the multiple file, you can simply take the code from nathancy and add a loop over each raster file that was extracted or converted from the PDF. – Broomcorn 17/7, 2019 at 22:19

1.) That's for you to decide, you could use some external Python library to do that. I personally would use a .pdf to image converter like smallpdf. 2.) After converting the .pdf to say a .png image, you would have an image for each page. All this would be in a directory. You could then iterate through each image and apply this solution – Umbria 17/7, 2019 at 22:20

@Umbria i will test this approach once i finalize the pdf to image conversion and get back to you. Once again, thanks for sharing this information. It is very helpful. – Arsenious 17/7, 2019 at 23:37

@Umbria why threshold 230 - 255 ? – Laky 8/12, 2020 at 11:55

@Laky just an arbitrary threshold value. It's actually probably better to let OpenCV automatically determine the value by using Otsu's or Adaptive thresholding – Umbria 17/2, 2021 at 1:0

-2

Detect and pdf or image using defined bounding box as individual images

Using Opencv method to detect image and crop would be unreasonable for small projects for smaller projects with knowledge of bounding area following code works perfect and also saves image with same resolution as in original pdf or image

from PIL import Image
def ImageCrop():
    img = Image.open("page_1.jpg")
    left = 90
    top = 580
    right = 1600
    bottom = 2000
    img_res = img.crop((left, top, right, bottom))
    with open(outfile4, 'w') as f:
        img_res.save(outfile4,'JPEG')
ImageCrop()

Unclear answered 22/11, 2021 at 11:22 Comment(0)

Detect and pdf or image using defined bounding box as individual images

Recommended topics

Hot tags