tabula vs camelot for table extraction from PDF
Asked Answered
E

2

4

I need to extract tables from pdf, these tables can be of any type, multiple headers, vertical headers, horizontal header etc.

I have implemented the basic use cases for both and found tabula doing a bit better than camelot still not able to detect all tables perfectly, and I am not sure whether it will work for all kinds or not.

So seeking suggestions from experts who have implemented similar use case.

Example PDFs: PDF1 PDF2 PDF3

Tabula Implementation:

import tabula
tab = tabula.read_pdf('pdfs/PDF1.pdf', pages='all')
for t in tab:
    print(t, "\n=========================\n")

Camelot Implementation:

import camelot
tables = camelot.read_pdf('pdfs/PDF1.pdf', pages='all', split_text=True)
tables
for tabs in tables:
    print(tabs.df, "\n=================================\n")
Escobar answered 23/4, 2020 at 12:32 Comment(2)
"still not able to detect all tables perfectly" - it is extremely unlikely that there will ever be a software detecting all tables perfectly.Undershrub
@Niranjan Kumar: Try SLICEmyPDF in 1 of the answers at #56018202Roth
C
10

Please read this: https://camelot-py.readthedocs.io/en/master/#why-camelot

The main advantage of Camelot is that this library is rich in parameters, through which you can improve the extraction.

Obviously, the application of these parameters requires some study and various attempts.

Here you can find comparision of Camelot with other PDF Table Extraction libraries.

Celeriac answered 24/4, 2020 at 9:23 Comment(0)
L
2

I think Camelot better extracts data in a clean format and not jumbled up ( i.e. data retains the information and row contents are not affected). So, The quality of data extracted is better in case of difference in the number of lines per cells . ->Tabula requires a Java Runtime Environment

There are open (Tabula, pdf-table-extract) source (smallpdf, PDFTables) tools that are widely used to extract tables from PDF files. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table. Camelot was created to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!

Liege answered 29/11, 2021 at 11:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.