Python Camelot borderless table extraction issue

Asked 8/11, 2018 at 14:3 Answered 9/2, 2021 at 11:50

I'm trying hard to extract some borderless table as show in the below image which are from pdf files. I have installed python-camelot as shown here and is working fine for bordered tables only. Please find below details:

platform - Linux-4.5.5-300.fc24.x86_64-x86_64-with-fedora-24-Twenty_Four

sys - Python 3.6.1 (default, May 15 2017, 11:42:04)[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)]

numpy - NumPy 1.15.4

cv2 - OpenCV 3.4.3

camelot - Camelot 0.3.2 enter image description here

Splint answered 8/11, 2018 at 14:3 Comment(2)

Can you post the code that you used to extract tables from this PDF using Camelot? – Alfons 19/11, 2018 at 19:57

@VinayakMehta The code is nothing but as given in the example for the below link [github.com/socialcopsdev/camelot]. I have tried all optional params but none seems to work. – Splint 12/12, 2018 at 10:16

To improve the detected area, you can increase the edge_tol (default: 50) value to counter the effect of text being placed relatively far apart vertically. Larger edge_tol will lead to longer textedges being detected, leading to an improved guess of the table area. Let’s use a value of 500.

>>> tables = camelot.read_pdf('edge_tol.pdf', flavor='stream', edge_tol=500)
>>> camelot.plot(tables[0], kind='contour')
>>> plt.show()
>>> tables[0].df

Conglobate answered 1/8, 2019 at 4:11 Comment(1)

readthedocs.org/projects/camelot-py/downloads/pdf/master genius! just putting up the link to the info in the pdf docs – Rounds 7/12, 2019 at 0:43

Camelot uses lattice by default which relies on clear lines dividing the cells.

For tables without lines you want to use stream:

tables = camelot.read_pdf('your_file_name.pdf', flavor = 'stream')

Griffis answered 6/3, 2019 at 16:13 Comment(2)

This did not work for some of the tables which has hidden borders – Moores 26/5, 2020 at 8:4

You have to pass table and column dimensions. @allahbaksh – Abecedary 2/4, 2022 at 9:1

Another solution that might help is setting the table_areas explicitely, e.g. to the size of the page :

# A4 portrait, MediaBox[0 0 595 842]
tables = camelot.read_pdf("filename.pdf", table_areas=["0,842,595,0"])

You can find the size of the area either throug Camelot’s visual debugging features, or by opening the PDF with a text editor and checking for MediaBox or CropBox dimensions (beware that they don’t use the same coordinates convention).

Cubiform answered 9/2, 2021 at 11:50 Comment(0)

Recommended topics

Hot tags