Python urllib getting access denied when browser works
Asked Answered
B

2

5

I’m trying to download a CSV file from this site:

http://www.nasdaq.com/screening/companies-by-name.aspx

If I enter this URL in my Chrome browser the csv file download starts immediately, and I get a file with data on a few thousand companies. However, if I use the code below I get a access denied error. There is no login on this page, so what is the Python code doing differently?

from urllib import urlopen

response = urlopen('http://www.nasdaq.com/screening/companies-by-name.aspx?&render=download')
csv = response.read()

# Save the string to a file
csvstr = str(csv).strip("b'")

lines = csvstr.split("\\n")
f = open("C:\Users\Ankit\historical.csv", "w")
for line in lines:
   f.write(line + "\n")
f.close()
Blither answered 25/7, 2014 at 18:13 Comment(1)
Proxy server in the way?Octofoil
M
9

The user agent headers for urllib2 (and similar urllib) is "Python-urllib/2.7" (replace 2.7 by your version of Python).

You're getting a 403 error because the NASDAQ server doesn't seem to want to send content to this user agent. You can “spoof” the user agent header, and then it downloads successfully. Here’s a minimal example:

import urllib2

DOWNLOAD_URL = 'http://www.nasdaq.com/screening/companies-by-name.aspx?&render=download'

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
req = urllib2.Request(DOWNLOAD_URL, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.fp.read()

content = page.read()
print content
Matchwood answered 25/7, 2014 at 18:29 Comment(0)
D
1

Or you can use python-requests

import requests

url = 'http://www.nasdaq.com/screening/companies-by-name.aspx'
params = {'':'', 'render':'download'}
resp = requests.get(url, params=params)
print resp.text
Dacy answered 25/7, 2014 at 18:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.