urllib2.HTTPError: HTTP Error 403: Forbidden

Asked 9/11, 2012 at 6:51 Answered 13/5, 2021 at 23:20

133

I am trying to automate download of historic stock data using python. The URL I am trying to open responds with a CSV file, but I am unable to open using urllib2. I have tried changing user agent as specified in few questions earlier, I even tried to accept response cookies, with no luck. Can you please help.

Note: The same method works for yahoo Finance.

Code:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"

hdr = {'User-Agent':'Mozilla/5.0'}

req = urllib2.Request(site,headers=hdr)

page = urllib2.urlopen(req)

Error

File "C:\Python27\lib\urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden

Thanks for your assistance

Electrothermal answered 9/11, 2012 at 6:51 Comment(1)

Are you use windows as platform ? – Darleen 9/11, 2012 at 7:8

202

By adding a few more headers I was able to get the data:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

req = urllib2.Request(site, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.fp.read()

content = page.read()
print content

Actually, it works with just this one additional header:

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

Dross answered 9/11, 2012 at 7:19 Comment(14)

Which of these headers to you think was missing from the origional request? – Crosseye 9/11, 2012 at 7:20

wireshark showed that only the User-Agent was sent, along with Connection: close, Host: www.nseindia.com, Accept-Encoding: identity – Dross 9/11, 2012 at 7:26

Andrean, thank you very much, it solved the issue, unfortunate and funny that I tried all headers except 'Accept' before posting here. – Electrothermal 9/11, 2012 at 11:32

You're welcome, well what I really did is I checked the url from your script in a browser, and as it worked there, I just copied all the request headers the browser sent, and added them here, and that was the solution. – Dross 9/11, 2012 at 12:34

Thank you!! All of my requests were getting blocked from various forums, and this solved my problem. I think this should definitely be posted along with setting the User-Agent as a solution to the 403 error; This happened to me on numerous sites (I think most of them were running myBB). – Dantedanton 6/3, 2013 at 17:24

@Dross How can I do this is python3 with urllib? – Evslin 19/1, 2015 at 21:4

@Mee did you take a look at the answer below? it was addressed specifically for python 3, check if it works for you... – Dross 19/1, 2015 at 21:7

@Dross I still get this error when I use the below solution. am trying to get googlepageRanke. raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden – Evslin 19/1, 2015 at 21:24

try adding the other headers (from my answer) as well to the request. still there are many other reasons why a server might return a 403, check out the other answers on the topic as well. as for the target, google especially is a tough one, kinda hard to scrape, they have implemented many methods to prevent scraping. – Dross 20/1, 2015 at 6:40

i was trying to download different url, for that it worked after removing Connection: Keep Alive. url : nseindia.com/content/historical/EQUITIES/2017/FEB/… – Lola 8/2, 2017 at 15:47

I just need the user-agent to replace my previous old one. – Unpopular 9/10, 2019 at 3:18

The code is working on the local but not working on the EC2 instance. Can you help me here? – Goffer 29/12, 2019 at 14:50

This worked in 2021 but now gets a 403 again. Outdated browser or something? – Varien 28/1, 2023 at 17:33

(Looks like the site I'm scraping is now behind cloudfare, so I need pypi.org/project/cloudscraper) – Varien 28/1, 2023 at 18:0

This will work in Python 3

import urllib.request

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers"
headers={'User-Agent':user_agent,} 

request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need

Steve answered 24/4, 2013 at 9:9 Comment(1)

It's true that some sites (including Wikipedia) block on common non-browser user agents strings, like the "Python-urllib/x.y" sent by Python's libraries. Even a plain "Mozilla" or "Opera" is usually enough to bypass that. This doesn't apply to the original question, of course, but it's still useful to know. – Shu 28/7, 2013 at 9:19

NSE website has changed and the older scripts are semi-optimum to current website. This snippet can gather daily details of security. Details include symbol, security type, previous close, open price, high price, low price, average price, traded quantity, turnover, number of trades, deliverable quantities and ratio of delivered vs traded in percentage. These conveniently presented as list of dictionary form.

Python 3.X version with requests and BeautifulSoup

from requests import get
from csv import DictReader
from bs4 import BeautifulSoup as Soup
from datetime import date
from io import StringIO 

SECURITY_NAME="3MINDIA" # Change this to get quote for another stock
START_DATE= date(2017, 1, 1) # Start date of stock quote data DD-MM-YYYY
END_DATE= date(2017, 9, 14)  # End date of stock quote data DD-MM-YYYY


BASE_URL = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol={security}&segmentLink=3&symbolCount=1&series=ALL&dateRange=+&fromDate={start_date}&toDate={end_date}&dataType=PRICEVOLUMEDELIVERABLE"




def getquote(symbol, start, end):
    start = start.strftime("%-d-%-m-%Y")
    end = end.strftime("%-d-%-m-%Y")

    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
         'Referer': 'https://cssspritegenerator.com',
         'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
         'Accept-Encoding': 'none',
         'Accept-Language': 'en-US,en;q=0.8',
         'Connection': 'keep-alive'}

    url = BASE_URL.format(security=symbol, start_date=start, end_date=end)
    d = get(url, headers=hdr)
    soup = Soup(d.content, 'html.parser')
    payload = soup.find('div', {'id': 'csvContentDiv'}).text.replace(':', '\n')
    csv = DictReader(StringIO(payload))
    for row in csv:
        print({k:v.strip() for k, v in row.items()})


 if __name__ == '__main__':
     getquote(SECURITY_NAME, START_DATE, END_DATE)

Besides this is relatively modular and ready to use snippet.

Iny answered 14/9, 2017 at 8:0 Comment(3)

Thanks, man! this worked for me instead of above answer from @Dross – Echinate 3/1, 2018 at 8:14

Hi, I really don't know where to bang my head anymore, I've tried this solution and many more but I keep getting error 403. Is there anything else I can try? – Boeotian 22/2, 2018 at 23:3

403 status is meant to inform that your browser is not authenticated to use this service. It may be that in your case, it genuinely requires authentication with basic auth, oauth etc. – Iny 23/2, 2018 at 23:35

This error usually occurs when the server you are requesting doesn't know where the request is coming from, the server does this to avoid any unwanted visit. You could bypass this error by defining a header and passing it along the urllib.request

Heres code:

#defining header
header= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
      'AppleWebKit/537.11 (KHTML, like Gecko) '
      'Chrome/23.0.1271.64 Safari/537.11',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive'}

#the URL where you are requesting at
req = urllib.request.Request(url=your_url, headers=header) 
page = urllib.request.urlopen(req).read()

Voccola answered 11/3, 2021 at 23:7 Comment(0)

There is one thing worth trying is just to update the python version. One of my crawling scripts stopped working with 403 on Windows 10 a few months back. Any user_agents did not help and I was about to give up the script. Today I tried the same script on Ubuntu with Python (3.8.5 - 64 bit) and it worked with no error. The python version of Windows was a bit old as 3.6.2 - 32 bit. After upgrading the python on Windows 10 to 3.9.5 - 64bit, I don't see the 403 any longer. If you give it a try, don't forget to run 'pip freeze > requirements.txt" to export package entries. I forgot it of course. This post is a reminder for me too when the 403 comes back again in the future.

Victimize answered 13/5, 2021 at 23:20 Comment(0)

import urllib.request

bank_pdf_list = ["https://www.hdfcbank.com/content/bbp/repositories/723fb80a-2dde-42a3-9793-7ae1be57c87f/?path=/Personal/Home/content/rates.pdf",
"https://www.yesbank.in/pdf/forexcardratesenglish_pdf",
"https://www.sbi.co.in/documents/16012/1400784/FOREX_CARD_RATES.pdf"]


def get_pdf(url):
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    
    #url = "https://www.yesbank.in/pdf/forexcardratesenglish_pdf"
    headers={'User-Agent':user_agent,} 
    
    request=urllib.request.Request(url,None,headers) #The assembled request
    response = urllib.request.urlopen(request)
    #print(response.text)
    data = response.read()
#    print(type(data))
    
    name = url.split("www.")[-1].split("//")[-1].split(".")[0]+"_FOREX_CARD_RATES.pdf"
    f = open(name, 'wb')
    f.write(data)
    f.close()
    

for bank_url in bank_pdf_list:
    try: 
        get_pdf(bank_url)
    except:
        pass

Bilocular answered 30/11, 2020 at 11:1 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Python 3.X version with requests and BeautifulSoup

Recommended topics

Hot tags