urllib2.HTTPError: HTTP Error 403: Forbidden
Asked Answered
E

6

133

I am trying to automate download of historic stock data using python. The URL I am trying to open responds with a CSV file, but I am unable to open using urllib2. I have tried changing user agent as specified in few questions earlier, I even tried to accept response cookies, with no luck. Can you please help.

Note: The same method works for yahoo Finance.

Code:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"

hdr = {'User-Agent':'Mozilla/5.0'}

req = urllib2.Request(site,headers=hdr)

page = urllib2.urlopen(req)

Error

File "C:\Python27\lib\urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden

Thanks for your assistance

Electrothermal answered 9/11, 2012 at 6:51 Comment(1)
Are you use windows as platform ?Darleen
D
202

By adding a few more headers I was able to get the data:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

req = urllib2.Request(site, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.fp.read()

content = page.read()
print content

Actually, it works with just this one additional header:

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
Dross answered 9/11, 2012 at 7:19 Comment(14)
Which of these headers to you think was missing from the origional request?Crosseye
wireshark showed that only the User-Agent was sent, along with Connection: close, Host: www.nseindia.com, Accept-Encoding: identityDross
Andrean, thank you very much, it solved the issue, unfortunate and funny that I tried all headers except 'Accept' before posting here.Electrothermal
You're welcome, well what I really did is I checked the url from your script in a browser, and as it worked there, I just copied all the request headers the browser sent, and added them here, and that was the solution.Dross
Thank you!! All of my requests were getting blocked from various forums, and this solved my problem. I think this should definitely be posted along with setting the User-Agent as a solution to the 403 error; This happened to me on numerous sites (I think most of them were running myBB).Dantedanton
@Dross How can I do this is python3 with urllib?Evslin
@Mee did you take a look at the answer below? it was addressed specifically for python 3, check if it works for you...Dross
@Dross I still get this error when I use the below solution. am trying to get googlepageRanke. raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: ForbiddenEvslin
try adding the other headers (from my answer) as well to the request. still there are many other reasons why a server might return a 403, check out the other answers on the topic as well. as for the target, google especially is a tough one, kinda hard to scrape, they have implemented many methods to prevent scraping.Dross
i was trying to download different url, for that it worked after removing Connection: Keep Alive. url : nseindia.com/content/historical/EQUITIES/2017/FEB/…Lola
I just need the user-agent to replace my previous old one.Unpopular
The code is working on the local but not working on the EC2 instance. Can you help me here?Goffer
This worked in 2021 but now gets a 403 again. Outdated browser or something?Varien
(Looks like the site I'm scraping is now behind cloudfare, so I need pypi.org/project/cloudscraper)Varien
S
75

This will work in Python 3

import urllib.request

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers"
headers={'User-Agent':user_agent,} 

request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
Steve answered 24/4, 2013 at 9:9 Comment(1)
It's true that some sites (including Wikipedia) block on common non-browser user agents strings, like the "Python-urllib/x.y" sent by Python's libraries. Even a plain "Mozilla" or "Opera" is usually enough to bypass that. This doesn't apply to the original question, of course, but it's still useful to know.Shu
I
12

NSE website has changed and the older scripts are semi-optimum to current website. This snippet can gather daily details of security. Details include symbol, security type, previous close, open price, high price, low price, average price, traded quantity, turnover, number of trades, deliverable quantities and ratio of delivered vs traded in percentage. These conveniently presented as list of dictionary form.

Python 3.X version with requests and BeautifulSoup

from requests import get
from csv import DictReader
from bs4 import BeautifulSoup as Soup
from datetime import date
from io import StringIO 

SECURITY_NAME="3MINDIA" # Change this to get quote for another stock
START_DATE= date(2017, 1, 1) # Start date of stock quote data DD-MM-YYYY
END_DATE= date(2017, 9, 14)  # End date of stock quote data DD-MM-YYYY


BASE_URL = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol={security}&segmentLink=3&symbolCount=1&series=ALL&dateRange=+&fromDate={start_date}&toDate={end_date}&dataType=PRICEVOLUMEDELIVERABLE"




def getquote(symbol, start, end):
    start = start.strftime("%-d-%-m-%Y")
    end = end.strftime("%-d-%-m-%Y")

    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
         'Referer': 'https://cssspritegenerator.com',
         'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
         'Accept-Encoding': 'none',
         'Accept-Language': 'en-US,en;q=0.8',
         'Connection': 'keep-alive'}

    url = BASE_URL.format(security=symbol, start_date=start, end_date=end)
    d = get(url, headers=hdr)
    soup = Soup(d.content, 'html.parser')
    payload = soup.find('div', {'id': 'csvContentDiv'}).text.replace(':', '\n')
    csv = DictReader(StringIO(payload))
    for row in csv:
        print({k:v.strip() for k, v in row.items()})


 if __name__ == '__main__':
     getquote(SECURITY_NAME, START_DATE, END_DATE)

Besides this is relatively modular and ready to use snippet.

Iny answered 14/9, 2017 at 8:0 Comment(3)
Thanks, man! this worked for me instead of above answer from @DrossEchinate
Hi, I really don't know where to bang my head anymore, I've tried this solution and many more but I keep getting error 403. Is there anything else I can try?Boeotian
403 status is meant to inform that your browser is not authenticated to use this service. It may be that in your case, it genuinely requires authentication with basic auth, oauth etc.Iny
V
8

This error usually occurs when the server you are requesting doesn't know where the request is coming from, the server does this to avoid any unwanted visit. You could bypass this error by defining a header and passing it along the urllib.request

Heres code:

#defining header
header= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
      'AppleWebKit/537.11 (KHTML, like Gecko) '
      'Chrome/23.0.1271.64 Safari/537.11',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive'}

#the URL where you are requesting at
req = urllib.request.Request(url=your_url, headers=header) 
page = urllib.request.urlopen(req).read()
Voccola answered 11/3, 2021 at 23:7 Comment(0)
V
2

There is one thing worth trying is just to update the python version. One of my crawling scripts stopped working with 403 on Windows 10 a few months back. Any user_agents did not help and I was about to give up the script. Today I tried the same script on Ubuntu with Python (3.8.5 - 64 bit) and it worked with no error. The python version of Windows was a bit old as 3.6.2 - 32 bit. After upgrading the python on Windows 10 to 3.9.5 - 64bit, I don't see the 403 any longer. If you give it a try, don't forget to run 'pip freeze > requirements.txt" to export package entries. I forgot it of course. This post is a reminder for me too when the 403 comes back again in the future.

Victimize answered 13/5, 2021 at 23:20 Comment(0)
B
1
import urllib.request

bank_pdf_list = ["https://www.hdfcbank.com/content/bbp/repositories/723fb80a-2dde-42a3-9793-7ae1be57c87f/?path=/Personal/Home/content/rates.pdf",
"https://www.yesbank.in/pdf/forexcardratesenglish_pdf",
"https://www.sbi.co.in/documents/16012/1400784/FOREX_CARD_RATES.pdf"]


def get_pdf(url):
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    
    #url = "https://www.yesbank.in/pdf/forexcardratesenglish_pdf"
    headers={'User-Agent':user_agent,} 
    
    request=urllib.request.Request(url,None,headers) #The assembled request
    response = urllib.request.urlopen(request)
    #print(response.text)
    data = response.read()
#    print(type(data))
    
    name = url.split("www.")[-1].split("//")[-1].split(".")[0]+"_FOREX_CARD_RATES.pdf"
    f = open(name, 'wb')
    f.write(data)
    f.close()
    

for bank_url in bank_pdf_list:
    try: 
        get_pdf(bank_url)
    except:
        pass
Bilocular answered 30/11, 2020 at 11:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.