Parsing Index page in a PDF text book with Python
Asked Answered
C

2

8

I have to extract text from PDF pages as it is with the indentation into a CSV file.

Index page from PDF text book:

I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application server is the class and Apache Tomcat is the subclass in the page number 275

This is the expected output of the CSV:

I have used Tika parser to parse the PDF, but the indentation is not maintained properly (not unique) in the parsed content for splitting the text into class and subclasses.

This is how the parsed text looks like:

Can anyone suggest me the right approach for this requirement?

Cammie answered 3/3, 2018 at 18:35 Comment(0)
S
4

despite I have no knowledge of pdf extraction, but it is possible to reconstruct the hierarchy from "the parsed text", because the "subclass" part always starts and ends with an extra newline character.

with following test text:

app architect . 50
app logic . 357
app server . 275

tomcat . 275
websphere . 275
jboss . 164

architect

acceptance . 303
development path . 304

architecting . 48
architectural activity . 25, 320

following code:

import csv
import sys
import re


def gen():
    is_subclass = False
    p_class = None

    with open('test.data') as f:
        s = f.read()
    lines = re.findall(r'[^\n]+\n+', s)
    for line in lines:
        if ' . ' in line:
            class_name, page_no = map(lambda s: s.strip(), line.split('.'))
        else:
            class_name, page_no = line.strip(), ''

        if line.endswith('\n\n'):
            if not is_subclass:
                p_class = class_name
                is_subclass = True
                continue

        if is_subclass:
            yield (p_class, class_name, page_no)
        else:
            yield (class_name, '', page_no)

        if line.endswith('\n\n'):
            is_subclass = False


writer = csv.writer(sys.stdout)
writer.writerows(gen())

yields:

app architect,,50
app logic,,357
app server,tomcat,275
app server,websphere,275
app server,jboss,164
architect,acceptance,303
architect,development path,304
architecting,,48
architectural activity,,"25, 320"

hope this helps.

Scauper answered 25/9, 2018 at 12:27 Comment(3)
HI, @Scauper thanks for your solution. But I am looking for a more end to end approach for this. As the pdf format keeps on changing.Handiwork
@BhushanPant haha I know it, I do it for fun. this is workable if you just want get things done.Scauper
The problem was the PDF extraction, only few people knows how to accurately extracts info from PDF file and restructure them properly. PANDAS was a great approach to stock all spans with their BBOX, otherwise, in that scenario, the row was clean and structured.Generative
G
2

So here is the solution:

  1. Install Fitz(PyMuPDF) https://github.com/rk700/PyMuPDF
  2. Run the code below in the same folder than your PDF file with Python 2.7
  3. Compare the result

Code:

import fitz
import json
import re
import csv

class MyClass:
    def __init__(self, text, main_class):
        my_arr = re.split("[.]*", text)
        if main_class != my_arr[0].strip():
            main_class = my_arr[0].strip()
        self.main_class = main_class
        self.sub_class = my_arr[0].strip()
        try:
            self.page = my_arr[1].strip()
        except:
            self.page = ""

def add_line(text, is_recording, main_class):
    if(is_recording):
        obj = MyClass(text, main_class)
        if obj.sub_class == "Glossary":
            return False, main_class
        table.append(obj)
        return True, obj.main_class
    elif text == "Contents":
        return True, main_class
    return False, main_class

last_text = ""
is_recording = False
main_class = ""
table = []

doc = fitz.open("TCS_1.pdf")
page = doc.getPageText(2, output="json")
blocks = json.loads(page)["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            line_text = ""
            for span in block["lines"]:
                line_text += span["spans"][0]["text"].encode("utf-8")
            if last_text != line_text:
                is_recording, main_class = add_line(line_text, is_recording, main_class)
                last_text = line_text

writer = csv.writer(open("output.csv", 'w'), delimiter=',', lineterminator='\n')
for my_class in table:
    writer.writerow([my_class.main_class, my_class.sub_class, my_class.page])
    # print(my_class.main_class, my_class.sub_class, my_class.page)

Here is the CSV output of the file provided: enter image description here

Generative answered 1/10, 2018 at 15:36 Comment(3)
Hi Jonathan, thanks for your answer. Can you share your email with me?Handiwork
I am not sure how to extend the bounty time? Can you help me with that?Handiwork
Actually I'm in time :) 25 minutes before the countdown. Hope you will enjoy of this solution.Generative

© 2022 - 2024 — McMap. All rights reserved.