Camelot PDF dimensions
Asked Answered
F

1

6

I have searched stackoverflow extensively before posting this and have not been able to find anything on camelot page dimensions. There is this question, which suggests using table_region but that does not solve OP's problem or mine. I unfortunately cannot comment to follow up with OP and see if they found a solution.

What I am trying to do:

I am using Camelot to identify tables (obviously). Sometimes, when I know the region of the page that might contain a table of interest, I want to search only in that region. This is easily done using camelot.read_pdf()'s table_region kwarg - I just need to provide a pair of coordinates for Camelot to search.

The issue is, I get these coordinates using PyMuPDF, so they are in PyMuPDF's coordinate system. I have figured out how to translate these coordinates but I am missing one key piece of information from Camelot - the dimensions of the page. These values are easy to get in PyMuPDF (the Page class .bound() attribute) and I need the Camelot equivalent. I can provide a further explanation of the algebra here if anyone thinks maybe there is alternative between

What I have tried so far:

I read the documentation. Because of this line in the documentation, I am wondering if this might provide a way to get the dimensions: "There might be cases while using Lattice when smaller lines don’t get detected. The size of the smallest line that gets detected is calculated by dividing the PDF page’s dimensions with a scaling factor called line_scale. By default, its value is 15"

I am open to alternatives, essentially I either want to check if a region of the page contains a table (region described in the PyMuPDF coordinate system, which for a pdf page the dimensions are typically (612, 792) with the origin in the top left corner. The origin for camelot is in the bottom left corner) or if any tables on the page are in a given region, if that makes sense.

Fennie answered 3/12, 2019 at 19:19 Comment(3)
In case anyone has a similar issue, I have discovered that camelot uses opencv's coordinate system, and the shape property gives the x and y dimensionsFennie
Would you be able to clarify how you got page dimensions out of the shape property?Burberry
@Burberry sure. Convert the pdf page into an image (there are several methods), and either load it into opencv (cv2.imread) or just convert it to a np.array, then img.shape[1] is the width and img.shape[0] is the heightFennie
P
4

Try the following code to see if it gives you the dimensions you want:

from camelot import utils
layout, dim = utils.get_page_layout(file_name)
Po answered 4/12, 2019 at 14:58 Comment(2)
Yes this works, actually. Looking at this, either layout or dim will provide the scaling. layout.width and layout.height should equal dim[0] and dim[1] respectively. Not sure why in their source code they set up dim the way they do, as I've never used the bbox attribute before to get the width/height when using pdfminer, but as I've become more familiar with pdfs, this is likely how I would approach the original question I posted now.Fennie
how to get dimensions of a particular page in a pdf in this method ?Exieexigency

© 2022 - 2024 — McMap. All rights reserved.