Download pdf using urllib?

Asked 19/7, 2014 at 20:33 Answered 24/12, 2020 at 9:27

I am trying to download a pdf file from a website using urllib. This is what i got so far:

import urllib

def download_file(download_url):
    web_file = urllib.urlopen(download_url)
    local_file = open('some_file.pdf', 'w')
    local_file.write(web_file.read())
    web_file.close()
    local_file.close()

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

When i run this code, all I get is an empty pdf file. What am I doing wrong?

Ellisellison answered 19/7, 2014 at 20:33 Comment(2)

Probably, you should first check the HTTP response code (getcode()). This might provide some clue. If all is OK at http-level, we have to look elsewhere. Have you tried to download a PDF from an other source? Could you provide the real URL of the PDF for testing purposes? – Instillation 19/7, 2014 at 20:44

To copy to a local file use urlretrieve – Slipway 19/7, 2014 at 20:49

Here is an example that works:

import urllib2

def main():
    download_file("http://mensenhandel.nl/files/pdftest2.pdf")

def download_file(download_url):
    response = urllib2.urlopen(download_url)
    file = open("document.pdf", 'wb')
    file.write(response.read())
    file.close()
    print("Completed")

if __name__ == "__main__":
    main()

Mayan answered 19/7, 2014 at 21:57 Comment(3)

As noted by shockburner, you need to use open("document.pdf", 'wb') – Lesh 16/12, 2015 at 10:59

This can work in python3 as well. All you have to do is change urllib2 to urllib.requests in both locations. – Ophthalmoscopy 27/5, 2020 at 19:28

*urllib.request –No s – Ophthalmoscopy 27/5, 2020 at 20:31

Change open('some_file.pdf', 'w') to open('some_file.pdf', 'wb'), pdf files are binary files so you need the 'b'. This is true with pretty much any file that you can't open in a text editor.

Motheaten answered 19/7, 2014 at 21:15 Comment(4)

I tried it, but it still don't work. When I try to open the pdf file, I get an error message saying that the file type is not supported or that the file is damaged. – Ellisellison 19/7, 2014 at 21:32

Odd, it works for me. Even your original code works for me without the 'b'. Can you download the pdf in a browser and open it normally. If you can then you should also include you python and urllib version with print urllib.__version__. You might also want to try urllib2 instead of urllib. – Motheaten 20/7, 2014 at 0:16

@Ellisellison Take a look at the file with a web browser. It's most likely a HTML page with a captcha from the file hosting site you are trying to download from. – Telencephalon 20/7, 2014 at 10:55

When I clicked URL I found on a website, it directly started downloading without loading me to another website. When I found the link in the HTML, pasted it in my a browser, I had to complete a CAPTCHA to download it. – Ellisellison 20/7, 2014 at 17:49

Try to use urllib.retrieve (Python 3) and just do that:

from urllib.request import urlretrieve

def download_file(download_url):
    urlretrieve(download_url, 'path_to_save_plus_some_file.pdf')

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

Lombardi answered 5/2, 2018 at 5:52 Comment(2)

from urllib.request import urlretrieve, request has typo :) – Showmanship 10/12, 2018 at 11:59

Thanks it helped me. – Sherrylsherurd 13/3, 2019 at 15:24

The tried the above code, they work fine in some cases, but for some website with pdf embedded in it, you might get an error like HTTPError: HTTP Error 403: Forbidden. Such websites have some server security features which will block known bots. In case of urllib it uses a header which will say something like ====> python urllib/3.3.0. So I would suggest adding a custom header too in request module of urllib as shown below.

from urllib.request import Request, urlopen 
import requests  
url="https://realpython.com/python-tricks-sample-pdf"  
import urllib.request  
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  
r = requests.get(url)

with open("<location to dump pdf>/<name of file>.pdf", "wb") as code:
    code.write(r.content)

Semela answered 25/9, 2018 at 11:27 Comment(1)

This seems to no longer work, it doesn't work with this url for example: crsreports.congress.gov/product/pdf/R/R44900/6 – Feltner 30/11, 2022 at 18:29

I would suggest using following lines of code

import urllib.request
import shutil
url = "link to your website for pdf file to download"
output_file = "local directory://name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
     shutil.copyfileobj(response, out_file)

Semela answered 2/3, 2018 at 0:18 Comment(0)

FYI: You can also use wget to download url pdfs easily. Urllib versions keep changing and often cause issues (at least for me).

import wget

wget.download(link)

Instead of entering the pdf link, you can also modify your code such that you enter a webpage link and extract all pdfs from there. Here's a guide for that: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Billon answered 24/12, 2020 at 9:27 Comment(0)

Recommended topics

Hot tags