I am trying to extract tables from a multiple page PDF file using camelot-py v0.7.3.
So far it has been the best pdf reader tool for me. I just needed to read pdf line by line and detect table manually. I tried many other tools such as tabula, PyPDF2/4, pdfminer, etc. Some of them could not detect the text itself properly and some of them disturbed the word sequences or spacing between the columns.
But camelot-py gave me the data in the format which is best suited for my application.
In the process of extracting data from the pdf using camelot-py, it detects all tables' data almost very well except few errors:
It is grouping multiple tables together in same 'TableList' element. But I am able separate these grouped tables. So no need to worry here.
Last table from these grouped tables is repeated in a saparate 'TableList' element. This repeatition is the main concern for me.
The code used for above process is as below:
tables = camelot.read_pdf('test.pdf', pages='1-end', flavor='stream')
tables.export('foo.csv', f='csv', compress=False)
for table in tables:
table_df = table.df
# Code to parse data from tables in each element converted into datafram
Why camelot-py is repeating some tables? Is there any way to handle this repeatition?
More info:
Input PDF File: I can't share the pdf files because of sensitive data. But here are some details which will give you good idea about its structure: All pages contain only tables. Page 1: Contains Table1 which contain customer's info. Table 2 to 4 with same structure
Page 2: Contains some rows from Table 4 and Table 5 to 7 with same structure as Table 2
Page 3: Table 8 to10 with same structure as Table 2
Output CSV files:
foo-page-1-table-1: Contain Table 1
foo-page-1-table-2: Contain last row (repeated) from Table 1 and Table 2 to 4
foo-page-2-table-1: Contain Table 7 (repeated with First row missing)
foo-page-2-table-2: Contain some end rows from Table 4 and Table 5 to 7
foo-page-3-table-1: Contain Table 10 (repeated fully)
foo-page-3-table-2: Contain Table 8 to 10