Same table is extracted twice from a pdf by Camelot-py
Asked Answered
O

0

11

I am trying to extract tables from a multiple page PDF file using camelot-py v0.7.3.

So far it has been the best pdf reader tool for me. I just needed to read pdf line by line and detect table manually. I tried many other tools such as tabula, PyPDF2/4, pdfminer, etc. Some of them could not detect the text itself properly and some of them disturbed the word sequences or spacing between the columns.

But camelot-py gave me the data in the format which is best suited for my application.

In the process of extracting data from the pdf using camelot-py, it detects all tables' data almost very well except few errors:

  1. It is grouping multiple tables together in same 'TableList' element. But I am able separate these grouped tables. So no need to worry here.

  2. Last table from these grouped tables is repeated in a saparate 'TableList' element. This repeatition is the main concern for me.

The code used for above process is as below:

tables = camelot.read_pdf('test.pdf', pages='1-end', flavor='stream')
tables.export('foo.csv', f='csv', compress=False)

for table in tables:
    table_df = table.df
    # Code to parse data from tables in each element converted into datafram

Why camelot-py is repeating some tables? Is there any way to handle this repeatition?

More info:

Input PDF File: I can't share the pdf files because of sensitive data. But here are some details which will give you good idea about its structure: All pages contain only tables. Page 1: Contains Table1 which contain customer's info. Table 2 to 4 with same structure

Page 2: Contains some rows from Table 4 and Table 5 to 7 with same structure as Table 2

Page 3: Table 8 to10 with same structure as Table 2

Output CSV files:

foo-page-1-table-1: Contain Table 1

foo-page-1-table-2: Contain last row (repeated) from Table 1 and Table 2 to 4

foo-page-2-table-1: Contain Table 7 (repeated with First row missing)

foo-page-2-table-2: Contain some end rows from Table 4 and Table 5 to 7

foo-page-3-table-1: Contain Table 10 (repeated fully)

foo-page-3-table-2: Contain Table 8 to 10

Osteal answered 21/2, 2020 at 18:12 Comment(5)
You did already rule out that the table data actually appears twice in the PDF?Realism
No... PDF contains each table only once. The repeatition of tables is in parsed data using camelot-py.Osteal
It is not ordinary Camelot behavior. Maybe your PDF is somewhat strange. Can you post it?Virtu
I can't share the pdf because of sensitive data contained in it. But see my update above to get more info abt its structure.Osteal
Facing the same issueSolita

© 2022 - 2024 — McMap. All rights reserved.