How can I extract tables from an image within a PDF / scanned PDF?

About

Asked 24/11, 2022 at 11:15 Answered 24/11, 2022 at 11:15

python ocr tabular python-camelot

The job is to extract the table from the scanned PDF. I tried using Camelot/tabula, but nothing worked.

Any suggestions on how can I extract the tables?

Example

Camelot/tabula none of them detects the table.

Attached the pdf link : https://drive.google.com/file/d/1atUmkNBkOGYFn43ZQreNqSg74XRhFP61/view?usp=sharing

Toastmaster answered 24/11, 2022 at 11:15 Comment(8)

What is the problem when you tried to Camelot, can you give us a hint? – Goodill 24/11, 2022 at 11:19

Camelot/tabula none of them recognize the table @SezaiBurakKantarcı – Toastmaster 24/11, 2022 at 11:23

Without the original PDF, it is difficult to help you. I add that, if the PDF is image-based (you can't select/copy text), neither Camelot nor Tabula work. – Unknown 24/11, 2022 at 11:33

@StefanoFiorucci-anakin87, I have attached the original pdf. What's the other way to extract the table other than pytesseract. any suggestions? – Toastmaster 24/11, 2022 at 11:43

Your issue is that this is a scanned drawing. If you have the originals, you should use those. If you only have the scanned image, you need to look into image-to-text libraries. – Ossify 24/11, 2022 at 12:1

I recommend trying the Table Transformer: huggingface.co/docs/transformers/model_doc/table-transformer – Checkrow 2/12, 2022 at 14:22

See also: extract a table from a non-scanned PDF – Doro 11/2, 2023 at 10:31

See also: https://mcmap.net/q/336676/-how-to-extract-a-table-as-text-from-the-pdf – Doro 11/2, 2023 at 10:46

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Example

Recommended topics

Hot tags