How to access a site via a headless driver without being denied permission
R

4

4

I am trying to retrieve the html code of a site using a headless chrome driver. However I get a "permission denied" message. If I use a "regular" driver it all works fine.

Is there any way to bypass that?

It's my first post so I do apologize for any potential mistakes in formatting

from selenium import webdriver

#Headless driver 

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')                                             

driver1 = webdriver.Chrome(executable_path='./chromedriver', options=chrome_options, 
service_args=['--verbose', '--log-path=/tmp/chromedriver.log'])

driver1.get('https://www.size.co.uk/')
html = driver1.page_source
html

The message I get is:

<html xmlns="http://www.w3.org/1999/xhtml"><head>\n<title>Access Denied</title>\n</head><body>\n<h1>Access Denied</h1>\n \nYou don\'t have permission to access "http://www.size.co.uk/" on this server.<p>\nReference #18.ac81655f.1548818550.73b12da\n\n\n</p></body></html>

Regular driver:

driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.size.co.uk/')
html = driver.page_source
driver.quit()
html

Ideally, I'd like the output to be as in the latter case without having new windows popping up every couple seconds.

Ruhl answered 30/1, 2019 at 3:32 Comment(0)
P
16

Adding in the following code snippet got the page to return for me:

user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'    
chrome_options.add_argument('user-agent={0}'.format(user_agent))

The site is obviously checking for headless browsers and then denying them access. Here's an article on avoiding detection: Making Chrome Headless Undetectable

To get the user agent being used by the driver you can run the following command:

driver.execute_script("return navigator.userAgent")

Chromes headless user agent is something like this:

u'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/71.0.3578.98 Safari/537.36'

Pollute answered 30/1, 2019 at 4:15 Comment(0)
G
0

you have to change user-agent in code

If you send a lot of requests, you have to change the user-agent value in every request There are many libraries in Python and other languages ​​to help you How to do it See link below for how to use it :

Way to change Google Chrome user agent in Selenium?

Griffe answered 7/9, 2020 at 9:57 Comment(0)
H
0

This user agent is not working anymore in Heroku: user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'

Using this one works:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36
Hochman answered 5/10, 2022 at 17:37 Comment(0)
L
0

This is what worked for me in March 2024:

options.addArguments("--headless=new");

instead of:

options.addArguments("--headless");

Try this in combination with the user-agent change suggested by @cullzie

Legato answered 28/3 at 8:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.