How can I extract tables from an image within a PDF / scanned PDF?
Asked Answered
T

0

0

The job is to extract the table from the scanned PDF. I tried using Camelot/tabula, but nothing worked.

Any suggestions on how can I extract the tables?

Example

enter image description here

Camelot/tabula none of them detects the table. enter image description here

Attached the pdf link : https://drive.google.com/file/d/1atUmkNBkOGYFn43ZQreNqSg74XRhFP61/view?usp=sharing

Toastmaster answered 24/11, 2022 at 11:15 Comment(8)
What is the problem when you tried to Camelot, can you give us a hint?Goodill
Camelot/tabula none of them recognize the table @SezaiBurakKantarcıToastmaster
Without the original PDF, it is difficult to help you. I add that, if the PDF is image-based (you can't select/copy text), neither Camelot nor Tabula work.Unknown
@StefanoFiorucci-anakin87, I have attached the original pdf. What's the other way to extract the table other than pytesseract. any suggestions?Toastmaster
Your issue is that this is a scanned drawing. If you have the originals, you should use those. If you only have the scanned image, you need to look into image-to-text libraries.Ossify
I recommend trying the Table Transformer: huggingface.co/docs/transformers/model_doc/table-transformerCheckrow
See also: extract a table from a non-scanned PDFDoro
See also: https://mcmap.net/q/336676/-how-to-extract-a-table-as-text-from-the-pdfDoro

© 2022 - 2025 — McMap. All rights reserved.