I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I automatically do this for a large, multi-page .pdf using python?
I tried using the PyPDF2
package to crop one of the handwriting boxes based on (x,y) coordinates, however this approach doesn't work for me as the boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf. I believe detecting the boxes would be a better approach for auto-cropping. Not sure if its useful, but below is the code I used for (x,y) coordinate approach:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("data/samples.pdf", "r")
# getting the first page
page = reader.getPage(0)
writer = PdfFileWriter()
# Loop through all pages in pdf object to crop based on (x,y) coordinates
for i in range(reader.getNumPages()):
page = reader.getPage(i)
page.cropBox.setLowerLeft((42, 115))
page.cropBox.setUpperRight((500, 245))
writer.addPage(page)
with open("samples_cropped.pdf", "wb") as fp:
writer.write(fp)
Thank you in advance for your help
apt-get install poppler-utils
? Do i need to use homebrew? In case it's relevant, i am on macOS, using conda virtual environment, coding in python on jupyter lab. – Arsenious