PDF miner - extract font size? [duplicate]

Asked 11/3, 2014 at 14:59 Answered 30/11, 2019 at 12:53

I'm curious if it's possible to use pdfminer to extract font size. I think this would be helpful for separating out different sections. I know there's the discussion below, but I'm curious if it's possible to use pdfminer

Extract text from PDF in respect to formatting (font size, type etc)

the pdfminer documentation says it's possible http://www.unixuser.org/~euske/python/pdfminer/

but when i type in he following into the command line, i just get a plain text document. I don't see any font information.

pdf2txt.py -o output.html samples/CentolaCV.pdf

e.g...

2008-13  Assistant Professor, Sloan School of Management, M.I.T.  

2006-08   Robert Wood Johnson Scholar in Health Policy, Harvard University 

2001-02   Visiting Scholar, The Brookings Institution

Sitting answered 11/3, 2014 at 14:59 Comment(0)

Try specifying the file output type with the -t flag:

pdf2txt.py -o output.html -t html samples/CentolaCV.pdf

That should return an html file with the style attributes font-family and font-size.

EDIT: actually, it looks like the output ending can specify the outfile type without the -t flag. Can you link to the pdf file that you're trying to extract font style from?

Universal answered 6/6, 2014 at 19:55 Comment(1)

Is it possible to get font-weight too? I need the text in bold. – Bedford 3/2, 2018 at 11:37

This task was puzzling me for a long time. Next to extracting fonts-information I also wanted to run this code in a python script.

Hower, today I was able to solve it. Below I wrote a script that calls the pdf2txt.py script from the command line and then extracts the font-information form the parsed PDF and newly created html file.

import os

pathToScript = r'path\to\script\pdf2txt.py'
pathPDFinput = os.path.join(path\to\file, 'test.pdf')
pathHTMLoutput = os.path.join(path\to\file, 'test.html')

# call the pdf2txt.py from the command line
os.system('python {} -o {} -S {} -t html'.format(pathToScript, pathHTMLoutput, pathPDFinput))

Extract the font-size for every html tag:

# credits to akash karothiya: 
# https://mcmap.net/q/324652/-need-to-extract-all-the-font-sizes-and-the-text-using-beautifulsoup/39015419#39015419

import re
import pandas as pd
from bs4 import BeautifulSoup

# open the html file
html = open(pathHTMLoutput, 'r')
soup = BeautifulSoup(html)

font_spans = [data for data in soup.select('span') if 'font-size' in str(data)]
output = []
for span in font_spans:
    fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)', str(span.get('style'))).group(2)
    fonts_family = re.search(r'(?is)(font-family:)(.*?)(;)', str(span.get('style'))).group(2)

    # split fonts_family into fonts-type and fonts-style
    try:
        fonts_type = fonts_family.strip().split(',')[0]
        fonts_style = fonts_family.strip().split(',')[1]
    except IndexError:
        fonts_type = fonts_family.strip()
        fonts_style = None

    output.append(
        (str(i.text).strip(), fonts_size.strip(), fonts_type, fonts_style)
    )

# create dataframe
df = pd.DataFrame(output, columns = ['text', 'fonts-size', 'fonts-type', 'fonts-style'])

Manrope answered 30/11, 2019 at 12:53 Comment(0)

Recommended topics

Hot tags