How to fix "403 Forbidden" errors with Python requests even with User-Agent headers?

Asked 15/11, 2022 at 13:55 Answered 30/5, 2023 at 7:36

Solved python python-requests http-status-code-403

I am sending a request to some URL. I copied the curl command to python. So, all the headers are included, but my request is not working and I receive status code 403 and error code 1020 in the HTML output.

The code is

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
}

response = requests.get('https://v2.gcchmc.org/book-appointment/', headers=headers)

print(response.status_code)
print(response.cookies.get_dict())
with open("test.html",'w') as f:
    f.write(response.text)

I also get cookies but not getting the desired response. I know I can do it with selenium but I want to know the reason behind this.

Note:
I have installed all the libraries and checked the versions, but it is still not working and throwing a 403 error.

Underrate answered 15/11, 2022 at 13:55 Comment(11)

The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it. This means that you are still missing something, this can be anything. You might need specific rights or your account is not allowed, or something else. Its hard to say. – Pyrrha 15/11, 2022 at 14:4

I use incognito mode to test the website and this is the first url that is opened by the browser. I dont think that something is missing but if there is something what it is – Underrate 15/11, 2022 at 14:10

It looks like the site is protected behind cloudflare which can be using can't even know what heursistics. The 403 reply comes from cloudflare, which contains a bunch of JavaScript to redirect the user to the real site after it passes CF's heuristics. – Ukrainian 15/11, 2022 at 14:13

I have just run your code and it works for me. Just copy pasted it in a file.py and run it. – Pyrrha 15/11, 2022 at 14:15

@Pyrrha I will try it on another pc – Underrate 15/11, 2022 at 14:18

@Ukrainian Have you executed it? – Underrate 15/11, 2022 at 14:18

Yes, trying on a different computer won't work either. You need some way to bypass cloudflare. – Ukrainian 15/11, 2022 at 14:21

It also worked on other pc. Then why it is not working on my current pc? Can any one suggest a fix – Underrate 15/11, 2022 at 14:21

Maybe this will work community.cloudflare.com/t/… this guy made some changes to his user-agent which allowed it to work again. – Pyrrha 15/11, 2022 at 14:37

@Pyrrha Still not working – Underrate 15/11, 2022 at 14:47

Actually this worked for me, I just set my own UA value from original device to look much realistic as I can – Inexecution 30/5, 2023 at 7:33

The site is protected by cloudflare which aims to block, among other things, unauthorized data scraping. From What is data scraping?

The process of web scraping is fairly simple, though the implementation can be complex. Web scraping occurs in 3 steps:

First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.

When the website responds, the scraper parses the HTML document for a specific pattern of data.

Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.

You can use urllib instead of requests, it seems to be able to deal with cloudflare

req = urllib.request.Request('https://v2.gcchmc.org/book-appointment/')
req.add_headers('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8')
req.add_header('Accept-Language', 'en-US,en;q=0.5')

r = urllib.request.urlopen(req).read().decode('utf-8')
with open("test.html", 'w', encoding="utf-8") as f:
    f.write(r)

Unremitting answered 4/12, 2022 at 9:28 Comment(6)

I have done it through aiohttp, so the question is not how to do it. The question is Why this happens. I will be using urllib as it is much easier for me. I will assign you the bounty if no other satisfying answer is given – Underrate 4/12, 2022 at 11:34

@farhanjatt I edited my answer, basically the site is protected by cloudflare, which tries to prevent scraping. – Unremitting 4/12, 2022 at 13:8

I like your answer, but I can run same script on computer and cloudflare allows it while the same script on my laptop returns error 403. Why this different behaviour. Why not run on both or not run on both because the versions of libraries are same – Underrate 4/12, 2022 at 13:15

@farhanjatt I'm not familiar enough with cloudflare to answer that. I guess it's possible your laptop IP was flagged. – Unremitting 4/12, 2022 at 14:13

It is not only on my laptop. It works on some but not on some computers – Underrate 4/12, 2022 at 14:20

This is a good question and none of the comments address it at all. I confirm the observation, but have not been able to found a good explanation. – Pibroch 14/4 at 14:41

It works on my machine, so I am not sure what the problem is.

However, when I want send a request which does not work, I often try if it works using playwright. Playwright uses a browser driver and thus mimics your actual browser when visiting the page. It can be installed using pip install playwright. When you try it for the first time it may give an error which tells you to install the drivers, just follow the instruction to do so.

With playwright you can try the following:

from playwright.sync_api import sync_playwright


url = 'https://v2.gcchmc.org/book-appointment/'
ua = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/69.0.3497.100 Safari/537.36"
)

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page(user_agent=ua)
    page.goto(url)
    page.wait_for_timeout(1000)
    
    html = page.content()
    
print(html)

A downside of playwright is that it requires the installation of the chromium (or other) browsers. This is a downside as it may complicate deployment, as the browser can not simply be added to requirements.txt, and a container image is required.

Yoakum answered 2/12, 2022 at 17:34 Comment(1)

As I told, I know about browser automation in selenium. This does not answer my question – Underrate 3/12, 2022 at 4:51

Try running Burp Suite's Proxy to see all the headers and other data like cookies. Then you could mimic the request with the Python module. That's what I always do.

Good luck!

Forebear answered 6/12, 2022 at 5:52 Comment(0)

The simplest way - just track in your devtools request, you can export request then in NodeJS request, not sure about Python.

But if Python not supported - still export into any available language and use AI like ChatGPT to rewrite it in Python.

I`m more NodeJS dev and start with Python so it helps me a lot. No need complex tools - use simplest and you will be impressed what the same DevTools can give to you.

Inexecution answered 30/5, 2023 at 7:36 Comment(0)

Recommended topics

Hot tags