HTTP Error 403: Forbidden with urlretrieve

import urllib.request import urllib.parse import requests def download_pdf(url): full_name = "Test.pdf" urllib.request.urlretrieve(url, full_name) try: url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf') print('initialized') hdr = {} hdr = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36', 'Content-Length': '136963', } print('HDR recieved') req = urllib.request.Request(url, headers=hdr) print('Header sent') resp = urllib.request.urlopen(req) print('Request sent') respData = resp.read() download_pdf(url) print('Complete') except Exception as e: print(str(e))

You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve():

import urllib.request

opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')

N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.

The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests, but you don't use it - you should though because it is much easier than urllib. This works for me:

import requests

url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
    outfile.write(r.content)

Recommended topics

Hot tags