How to extract text under specific headings from a pdf?
Asked Answered
A

3

14

I want to extract text under specific headings from a pdf using python.

For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.

How can I do this?

sample-image

Albanese answered 5/1, 2018 at 5:19 Comment(6)
As I'm not into PDF processing with python, I cannot give an answer; knowing a bit about PDFs, though, let me hint towards some difficulty: Your example file has two text columns. This is not necessarily reflected in the internal PDF contents, though. Depending on the document itself, there most probably is a solution for the task; to present a matching solution, though, the PDF in question had better be provided. Otherwise people may present solutions working for similar documents but not yours. Or not present a solution as they cannot test whether it matches.Insult
@Midhun Opening a bounty may be a nice idea, but even then cooperation by the OP is required, and Alfiya has not replied to usr2564301's comment under Ankit's answer whether using the tip in that comment the answer did solve the issue.Insult
Were you able to get the solution?Kibbutz
@user2999110 Hey, I could understand, regex is the only solution. But I couldn't find any solution for a pdf with unpredictable heading formats.In such cases, regex won't workAlbanese
@Albanese have you found the solution for this?Dowd
@Dowd Hey, I couldn't solve this for unstructured pdf'sAlbanese
M
9

This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.

Mercurio answered 10/7, 2019 at 9:14 Comment(2)
Could you please add code snippets and explain its steps for better understanding?Platy
@Mercurio can you please share the code for this?Dowd
B
3

Pdf is unstructured text so there are no tags to extract data directly. So we use regular expression to find desired information from a corpus of text. Extract raw page text using following code.

import fitz
page = pdf_file.loadPage(0) # 0 represents the page number... upto n-1 pages...
dl = page.getDisplayList()
tp = dl.getTextPage()
tp_text=tp.extractText()
re.split('\n\d+.+[ \t][a-zA-Z].+\n',tp_text)

Then apply regular expression as per your need... ( this re worked for me but you may or may not need to change it)

I am giving a detailed example how this will work

re.findall('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")

Output : ['\n1. heading 1\n', '\n1.2.3 Heading 2\n']

You can use re.split to split text per headings and retrieve you desired heading text.

re.split('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")

Output: ['some text', 'paragraph 1', 'parapgraph 2']

Simply ith heading will have (i+1) heading text.

Bootee answered 10/2, 2020 at 22:7 Comment(0)
V
3

The best method i found using regular expression

regex = r"^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*"
print(re.findall(regex,samplestring, re.M))

Verbena answered 22/7, 2020 at 6:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.