Pdfplumber cannot recognise table python [duplicate]
Asked Answered
O

2

5

I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.

How can I get the table? link of the pdf which doesn't work: pdfA

link of the pdf which works: pdfB

Here is my code:

import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()

import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df

and the result is enter image description here

But the table I want in page 2 is enter image description here

However, this code works for pdfB (which I mentioned above).

Btw, the table I want in each pdf is in section 3.

Anyone can help?

Many thanks Joan

Updated: I just found a good package to extract pdf file without any problems. the package is fitz, and it also names as PyMuPDF.

Ovule answered 20/7, 2020 at 17:1 Comment(2)
Hi Joan, Thank you for the question ,I am working on similar pdf, could it be possible to connect?Countable
@AjayPyatha what do you mean?Ovule
L
8

Hey Here is the proper solution for that problem but first please read some of my points below

  • Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
  • As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
  • For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables

Below Is the solution of your problem,

import pandas as pd
import pdfplumber 
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines", 
                                         "horizontal_strategy": "text", 
                                         "snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df

See the output of the Above Code

Lisp answered 28/7, 2020 at 12:13 Comment(1)
Hi Faizan, how do you know the intersection_y_tolerance:1, and snap_tolerance:4? Many thanksOvule
F
1

To extract two tables from the same pages, I use this code:

import pdfplumber

with pdfplumber.open("file.pdf") as pdf:
    first_page = pdf.pages[0].find_tables()
    t1_content = first_page[0].extract(x_tolerance = 5)
    t2_content = first_page[1].extract(x_tolerance = 5)
    print(t1_content, '\n' ,t2_content)
Fefeal answered 11/9, 2021 at 13:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.