How to scrape PDFs using Python; specific content only

Asked 1/12, 2019 at 22:43 Answered 2/12, 2019 at 15:38

python web-scraping scrapy tabula pdf-scraping

I am trying to get data from PDFs available on the site

https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en

For example, If I look at November 2019 report

https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/dz011445t/mg74r196p/latest.pdf

I need the data on Page 12 for corns, I have to create separate files for ending stocks, exports etc. I am new to Python and I am not sure how to scrape the content separately. If I can figure it out for one month then I can create a loop. But, I am confused on how to proceed for one file.

Can someone help me out here, TIA.

Horizon answered 1/12, 2019 at 22:43 Comment(4)

if page sends all in one PDF then you will have to download this file and later use other modules to get data from PDF. But these modules have nothing to do with 'scraping'. They are describe by word edit or extract. – Mulford 1/12, 2019 at 22:56

I checked this page and I see links to files txt, xls, xml - it would be easier to get txt file and work with text - eventually with xml or xls. – Mulford 1/12, 2019 at 23:0

Actually they do not have text files for all the years, that's why I was thinking to extract from PDFs – Horizon 1/12, 2019 at 23:5

using requests or urllib you can get HTML from server, using BeautifulSoup you can find links to PDF in HTML, using these links with requests or urllib you can download PDF. Later you would have to use other tools to work with PDF. There are modules PDFMiner, PyPDF2 to work with PDF in Python but I don't have experience with this. – Mulford 1/12, 2019 at 23:27

Here a little example using PyPDF2 ,requests and BeautifulSoup ...pls check the notes comment , this is for first block ...if you need more is necesary change the value in url variable

# You need install :
# pip install PyPDF2 - > Read and parse your content pdf
# pip install requests - > request for get the pdf
# pip install BeautifulSoup - > for parse the html and find all url hrf with ".pdf" final
from PyPDF2 import PdfFileReader
import requests
import io
from bs4 import BeautifulSoup

url=requests.get('https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items')
soup = BeautifulSoup(url.content,"lxml")

for a in soup.find_all('a', href=True):
    mystr= a['href']
    if(mystr[-4:]=='.pdf'):
        print ("url with pdf final:", a['href'])
        urlpdf = a['href']
        response = requests.get(urlpdf)
        with io.BytesIO(response.content) as f:
            pdf = PdfFileReader(f)
            information = pdf.getDocumentInfo()
            number_of_pages = pdf.getNumPages()
            txt = f"""
            Author: {information.author}
            Creator: {information.creator}
            Producer: {information.producer}
            Subject: {information.subject}
            Title: {information.title}
            Number of pages: {number_of_pages}
            """
            # Here the metadata of your pdf
            print(txt)
            # numpage for the number page
            numpage=20
            page = pdf.getPage(numpage)
            page_content = page.extractText()
            # print the content in the page 20            
            print(page_content)

Goulder answered 2/12, 2019 at 0:3 Comment(0)

I would recommend Beautiful Soup if you need to scrape data from a website ,but it looks like you are going to need OCR for extracting the data from the PDF. There is something called pytesseract. Look into that and the tutorials and you should be set.

Sisk answered 1/12, 2019 at 23:9 Comment(0)

Try pdfreader. You can extract the tables as PDF markdown containing decoded text strings and parse then as plain texts.


from pdfreader import SimplePDFViewer
fd = open("latest.pdf","rb")
viewer = SimplePDFViewer(fd)
viewer.navigate(12)
viewer.render()
markdown = viewer.canvas.text_content

markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator. For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects

You can parse it with regular expressions for example.

Haycock answered 2/12, 2019 at 15:38 Comment(0)

Recommended topics

Hot tags