I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.
How can I do this?
I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.
How can I do this?
This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.
Pdf is unstructured text so there are no tags to extract data directly. So we use regular expression to find desired information from a corpus of text. Extract raw page text using following code.
import fitz
page = pdf_file.loadPage(0) # 0 represents the page number... upto n-1 pages...
dl = page.getDisplayList()
tp = dl.getTextPage()
tp_text=tp.extractText()
re.split('\n\d+.+[ \t][a-zA-Z].+\n',tp_text)
Then apply regular expression as per your need... ( this re worked for me but you may or may not need to change it)
I am giving a detailed example how this will work
re.findall('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output : ['\n1. heading 1\n', '\n1.2.3 Heading 2\n']
You can use re.split
to split text per headings and retrieve you desired heading text.
re.split('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output: ['some text', 'paragraph 1', 'parapgraph 2']
Simply ith heading will have (i+1) heading text.
The best method i found using regular expression
regex = r"^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*"
print(re.findall(regex,samplestring, re.M))
© 2022 - 2024 — McMap. All rights reserved.