Download pdf using urllib?
Asked Answered
E

6

23

I am trying to download a pdf file from a website using urllib. This is what i got so far:

import urllib

def download_file(download_url):
    web_file = urllib.urlopen(download_url)
    local_file = open('some_file.pdf', 'w')
    local_file.write(web_file.read())
    web_file.close()
    local_file.close()

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

When i run this code, all I get is an empty pdf file. What am I doing wrong?

Ellisellison answered 19/7, 2014 at 20:33 Comment(2)
Probably, you should first check the HTTP response code (getcode()). This might provide some clue. If all is OK at http-level, we have to look elsewhere. Have you tried to download a PDF from an other source? Could you provide the real URL of the PDF for testing purposes?Instillation
To copy to a local file use urlretrieveSlipway
M
25

Here is an example that works:

import urllib2

def main():
    download_file("http://mensenhandel.nl/files/pdftest2.pdf")

def download_file(download_url):
    response = urllib2.urlopen(download_url)
    file = open("document.pdf", 'wb')
    file.write(response.read())
    file.close()
    print("Completed")

if __name__ == "__main__":
    main()
Mayan answered 19/7, 2014 at 21:57 Comment(3)
As noted by shockburner, you need to use open("document.pdf", 'wb')Lesh
This can work in python3 as well. All you have to do is change urllib2 to urllib.requests in both locations.Ophthalmoscopy
*urllib.request –No sOphthalmoscopy
M
13

Change open('some_file.pdf', 'w') to open('some_file.pdf', 'wb'), pdf files are binary files so you need the 'b'. This is true with pretty much any file that you can't open in a text editor.

Motheaten answered 19/7, 2014 at 21:15 Comment(4)
I tried it, but it still don't work. When I try to open the pdf file, I get an error message saying that the file type is not supported or that the file is damaged.Ellisellison
Odd, it works for me. Even your original code works for me without the 'b'. Can you download the pdf in a browser and open it normally. If you can then you should also include you python and urllib version with print urllib.__version__. You might also want to try urllib2 instead of urllib.Motheaten
@Ellisellison Take a look at the file with a web browser. It's most likely a HTML page with a captcha from the file hosting site you are trying to download from.Telencephalon
When I clicked URL I found on a website, it directly started downloading without loading me to another website. When I found the link in the HTML, pasted it in my a browser, I had to complete a CAPTCHA to download it.Ellisellison
L
7

Try to use urllib.retrieve (Python 3) and just do that:

from urllib.request import urlretrieve

def download_file(download_url):
    urlretrieve(download_url, 'path_to_save_plus_some_file.pdf')

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')
Lombardi answered 5/2, 2018 at 5:52 Comment(2)
from urllib.request import urlretrieve, request has typo :)Showmanship
Thanks it helped me.Sherrylsherurd
S
4

The tried the above code, they work fine in some cases, but for some website with pdf embedded in it, you might get an error like HTTPError: HTTP Error 403: Forbidden. Such websites have some server security features which will block known bots. In case of urllib it uses a header which will say something like ====> python urllib/3.3.0. So I would suggest adding a custom header too in request module of urllib as shown below.

from urllib.request import Request, urlopen 
import requests  
url="https://realpython.com/python-tricks-sample-pdf"  
import urllib.request  
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  
r = requests.get(url)

with open("<location to dump pdf>/<name of file>.pdf", "wb") as code:
    code.write(r.content)
Semela answered 25/9, 2018 at 11:27 Comment(1)
This seems to no longer work, it doesn't work with this url for example: crsreports.congress.gov/product/pdf/R/R44900/6Feltner
S
1

I would suggest using following lines of code

import urllib.request
import shutil
url = "link to your website for pdf file to download"
output_file = "local directory://name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
     shutil.copyfileobj(response, out_file)
Semela answered 2/3, 2018 at 0:18 Comment(0)
B
1

FYI: You can also use wget to download url pdfs easily. Urllib versions keep changing and often cause issues (at least for me).

import wget

wget.download(link)

Instead of entering the pdf link, you can also modify your code such that you enter a webpage link and extract all pdfs from there. Here's a guide for that: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Billon answered 24/12, 2020 at 9:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.